Yan Xu created MESOS-9368:
-----------------------------
Summary: The agent can be resending status updates too
aggressively and the backoff is not configurable
Key: MESOS-9368
URL: https://issues.apache.org/jira/browse/MESOS-9368
Project: Mesos
Issue Type: Bug
Reporter: Yan Xu
The current behavior is that when the agent queue status updates in a "stream"
which has an exponential backoff window from 10secs to 10mins. In each retry
the front of the queue is sent so if multiple statuses are queued up,
subsequent ones are not attempted unless the first one is acked. So if the
frameworks are for some reason not able to ack at all, there is one update per
task in flight at a time.
If in a cluster we have 500,000 tasks with pending status updates and the
master fails over, after each agent is reregistered it starts to send these
updates or we are looking at 500,000 updates ~immediately + 500,000 updates
10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.
Given that the initial communication of task state is covered by the agent
reregistration message and the framework reconciliation requests, it seems that
we can safely reduce the retry frequency further, optionally of course. It's
not currently configurable so we need to expose a flag for it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)