[
https://issues.apache.org/jira/browse/MESOS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182365#comment-14182365
]
Brenden Matthews commented on MESOS-1973:
-----------------------------------------
The master OOMs at 17:46:37.700026 in the log, after which you can see the
master restart. The master gets up to ~60G of memory before getting killed.
I see (after examining the code) that there is indeed an exponential backoff,
with a maximum 1 minute delay. However, I think it's possible that the
re-registration completes for many of the slaves before all of them have had a
chance to re-register, and the cycle repeats. Or at least, that's what appears
to be happening.
A possible workaround would be to increase the slave's
`flags.registration_backoff_factor` to adequately delay the process.
Another option might be to maintain some state within the slave to determine
the backoff, rather than passing it through the function call stack.
> Slaves DoS master on re-registration
> ------------------------------------
>
> Key: MESOS-1973
> URL: https://issues.apache.org/jira/browse/MESOS-1973
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 0.20.1
> Reporter: Brenden Matthews
> Priority: Critical
> Attachments: master-fail.log.gz
>
>
> Recently we've noticed a problem where the master fails over and gets DoS'd
> by the slaves during re-registration. This is caused by a large swath of
> "Possibly orphaned completed task ..." log messages in the master.
> After several hundred of these re-registrations, the master balloons and then
> gets OOM killed by the OS.
> The temporary fix is to stop all the slaves, let a master get elected as
> leader, and then do a slow rolling restart of the slaves (i.e., start one
> slave every 500ms).
> The fix might be to include an exponential backoff during slave
> re-registration.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)