[jira] [Commented] (MESOS-1973) Slaves DoS master on re-registration

Benjamin Mahler (JIRA) Thu, 23 Oct 2014 18:29:29 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182289#comment-14182289
 ]


Benjamin Mahler commented on MESOS-1973:
----------------------------------------

Slaves do use exponential backoff to re-register with the master.

The log you posed is a strange slice of time, could we see the log from when 
the master is elected, to when the OOM occurs? Looks like it starts in the 
middle of the recovery run:

{noformat}
W1023 17:46:33.796649 40851 master.cpp:4203] Possibly orphaned completed task 
23899 of framework 20141013-220327-3434197514-31806-78317-0339 that ran on 
slave 20140926-142803-3852091146-5050-3487-229 at slave(1)@10.102.191.208:5051 
(i-dfb82b32.inst.aws.airbnb.com)
...
I1023 17:46:42.536113 41161 main.cpp:157] Version: 0.20.1
...
{noformat}

Do you know how much memory it was consuming?

> Slaves DoS master on re-registration
> ------------------------------------
>
>                 Key: MESOS-1973
>                 URL: https://issues.apache.org/jira/browse/MESOS-1973
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.20.1
>            Reporter: Brenden Matthews
>            Priority: Critical
>         Attachments: master-fail.log.gz
>
>
> Recently we've noticed a problem where the master fails over and gets DoS'd 
> by the slaves during re-registration.  This is caused by a large swath of 
> "Possibly orphaned completed task ..." log messages in the master.
> After several hundred of these re-registrations, the master balloons and then 
> gets OOM killed by the OS.
> The temporary fix is to stop all the slaves, let a master get elected as 
> leader, and then do a slow rolling restart of the slaves (i.e., start one 
> slave every 500ms).
> The fix might be to include an exponential backoff during slave 
> re-registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1973) Slaves DoS master on re-registration

Reply via email to