[
https://issues.apache.org/jira/browse/MESOS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182289#comment-14182289
]
Benjamin Mahler commented on MESOS-1973:
----------------------------------------
Slaves do use exponential backoff to re-register with the master.
The log you posed is a strange slice of time, could we see the log from when
the master is elected, to when the OOM occurs? Looks like it starts in the
middle of the recovery run:
{noformat}
W1023 17:46:33.796649 40851 master.cpp:4203] Possibly orphaned completed task
23899 of framework 20141013-220327-3434197514-31806-78317-0339 that ran on
slave 20140926-142803-3852091146-5050-3487-229 at slave(1)@10.102.191.208:5051
(i-dfb82b32.inst.aws.airbnb.com)
...
I1023 17:46:42.536113 41161 main.cpp:157] Version: 0.20.1
...
{noformat}
Do you know how much memory it was consuming?
> Slaves DoS master on re-registration
> ------------------------------------
>
> Key: MESOS-1973
> URL: https://issues.apache.org/jira/browse/MESOS-1973
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 0.20.1
> Reporter: Brenden Matthews
> Priority: Critical
> Attachments: master-fail.log.gz
>
>
> Recently we've noticed a problem where the master fails over and gets DoS'd
> by the slaves during re-registration. This is caused by a large swath of
> "Possibly orphaned completed task ..." log messages in the master.
> After several hundred of these re-registrations, the master balloons and then
> gets OOM killed by the OS.
> The temporary fix is to stop all the slaves, let a master get elected as
> leader, and then do a slow rolling restart of the slaves (i.e., start one
> slave every 500ms).
> The fix might be to include an exponential backoff during slave
> re-registration.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)