[
https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xudong Ni reassigned MESOS-9368:
--------------------------------
Assignee: Xudong Ni
> The agent can be resending status updates too aggressively and the backoff is
> not configurable
> ----------------------------------------------------------------------------------------------
>
> Key: MESOS-9368
> URL: https://issues.apache.org/jira/browse/MESOS-9368
> Project: Mesos
> Issue Type: Bug
> Reporter: Yan Xu
> Assignee: Xudong Ni
> Priority: Major
>
> The current behavior is that when the agent queue status updates in a
> "stream" which has an exponential backoff window from 10secs to 10mins. In
> each retry the front of the queue is sent so if multiple statuses are queued
> up, subsequent ones are not attempted unless the first one is acked. So if
> the frameworks are for some reason not able to ack at all, there is one
> update per task in flight at a time.
> If in a cluster we have 500,000 tasks with pending status updates and the
> master fails over, after each agent is reregistered it starts to send these
> updates or we are looking at 500,000 updates ~immediately + 500,000 updates
> 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.
> Given that the initial communication of task state is covered by the agent
> reregistration message and the framework reconciliation requests, it seems
> that we can safely reduce the retry frequency further, optionally of course.
> It's not currently configurable so we need to expose a flag for it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)