[ 
https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xudong Ni reassigned MESOS-9368:
--------------------------------

    Assignee: Xudong Ni

> The agent can be resending status updates too aggressively and the backoff is 
> not configurable
> ----------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9368
>                 URL: https://issues.apache.org/jira/browse/MESOS-9368
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yan Xu
>            Assignee: Xudong Ni
>            Priority: Major
>
> The current behavior is that when the agent queue status updates in a 
> "stream" which has an exponential backoff window from 10secs to 10mins. In 
> each retry the front of the queue is sent so if multiple statuses are queued 
> up, subsequent ones are not attempted unless the first one is acked. So if 
> the frameworks are for some reason not able to ack at all, there is one 
> update per task in flight at a time.
> If in a cluster we have 500,000 tasks with pending status updates and the 
> master fails over, after each agent is reregistered it starts to send these 
> updates or we are looking at 500,000 updates ~immediately + 500,000 updates 
> 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.
> Given that the initial communication of task state is covered by the agent 
> reregistration message and the framework reconciliation requests, it seems 
> that we can safely reduce the retry frequency further, optionally of course. 
> It's not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to