[jira] [Created] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

Yan Xu (JIRA) Fri, 02 Nov 2018 11:02:07 -0700

Yan Xu created MESOS-9368:
-----------------------------

             Summary: The agent can be resending status updates too 
aggressively and the backoff is not configurable
                 Key: MESOS-9368
                 URL: https://issues.apache.org/jira/browse/MESOS-9368
             Project: Mesos
          Issue Type: Bug
            Reporter: Yan Xu



The current behavior is that when the agent queue status updates in a "stream" 
which has an exponential backoff window from 10secs to 10mins. In each retry 
the front of the queue is sent so if multiple statuses are queued up, 
subsequent ones are not attempted unless the first one is acked. So if the 
frameworks are for some reason not able to ack at all, there is one update per 
task in flight at a time.

If in a cluster we have 500,000 tasks with pending status updates and the 
master fails over, after each agent is reregistered it starts to send these 
updates or we are looking at 500,000 updates ~immediately + 500,000 updates 
10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.

Given that the initial communication of task state is covered by the agent 
reregistration message and the framework reconciliation requests, it seems that 
we can safely reduce the retry frequency further, optionally of course. It's 
not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

Reply via email to