Brenden Matthews created MESOS-1973:
---------------------------------------

             Summary: Slaves DoS master on re-registration
                 Key: MESOS-1973
                 URL: https://issues.apache.org/jira/browse/MESOS-1973
             Project: Mesos
          Issue Type: Bug
          Components: master
    Affects Versions: 0.20.1
            Reporter: Brenden Matthews
            Priority: Critical


Recently we've noticed a problem where the master fails over and gets DoS'd by 
the slaves during re-registration.  This is caused by a large swath of 
"Possibly orphaned completed task ..." log messages in the master.

After several hundred of these re-registrations, the master balloons and then 
gets OOM killed by the OS.

The temporary fix is to stop all the slaves, let a master get elected as 
leader, and then do a slow rolling restart of the slaves (i.e., start one slave 
every 500ms).

The fix might be to include an exponential backoff during slave re-registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to