Zach Carlson created MESOS-2122:
-----------------------------------

             Summary: MesosSchedulerDriver stop causes resource offer exhaustion
                 Key: MESOS-2122
                 URL: https://issues.apache.org/jira/browse/MESOS-2122
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 0.20.0, 0.21.0
         Environment: x86_64 Debian Wheezy (w/ mesosphere repos, packages)
            Reporter: Zach Carlson


For additional consideration, see https://github.com/airbnb/chronos/issues/290 
and https://github.com/mesosphere/marathon/issues/787

When the SchedulerProcess managed by the MesosSchedulerDriver detects a master, 
it performs a link() to the master. Libprocess proceeds to establish the link. 
Once the scheduler has performed all the work necessary, it may call 
MesosSchedulerDriver.stop(failover = true). 

This is where things go awry: at this point, the SchedulerProcess schedules a 
termination event for itself. When libprocess's schedule thread rolls through, 
it performs a cleanup() of the SchedulerProcess, as expected. Part of the 
cleanup() is calling SocketManager::exited() on the SchedulerProcess. The 
problem with this is that SocketManager::exited() cleans up the links from the 
link map, but does not actually close the sockets. Now, since 
MesosSchedulerDriver::stop() was called with failover = true, no 
DeregisterFramework message was sent, so the Mesos master believes that the 
connection (which is still active) is still valid with a registered framework 
listening for events. It sends resourceOffers to the 'valid' framework... and 
since there's nothing actually listening for events, no response is sent, no 
offers are accepted or declined, and Mesos will grind to a halt (*until version 
0.21.0, which will (according to release notes) rescind un-responded offers 
after a configurable timeout) -- no further offers made to any framework, and 
when all current framework work has completed, no further work will be 
performed due to the offers being wasted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to