[jira] [Commented] (MESOS-1973) Slaves DoS master on re-registration

Benjamin Mahler (JIRA) Mon, 27 Oct 2014 11:27:59 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185560#comment-14185560
 ]


Benjamin Mahler commented on MESOS-1973:
----------------------------------------

Ok, although IMO this is still important as messages that are in the master's 
queue may be quite large without this patch.

For example, if 20,000 slaves send 10MB of {{data}} and {{message}} information 
in the task history during re-registration, that means the master may wind up 
storing 200GB of messages in its queue during re-registration.

> Slaves DoS master on re-registration
> ------------------------------------
>
>                 Key: MESOS-1973
>                 URL: https://issues.apache.org/jira/browse/MESOS-1973
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.20.1
>            Reporter: Brenden Matthews
>            Priority: Critical
>         Attachments: master-fail.log.gz
>
>
> Recently we've noticed a problem where the master fails over and gets DoS'd 
> by the slaves during re-registration.  This is caused by a large swath of 
> "Possibly orphaned completed task ..." log messages in the master.
> After several hundred of these re-registrations, the master balloons and then 
> gets OOM killed by the OS.
> The temporary fix is to stop all the slaves, let a master get elected as 
> leader, and then do a slow rolling restart of the slaves (i.e., start one 
> slave every 500ms).
> The fix might be to include an exponential backoff during slave 
> re-registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1973) Slaves DoS master on re-registration

Reply via email to