[
https://issues.apache.org/jira/browse/MESOS-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zach Carlson updated MESOS-2122:
--------------------------------
Attachment: mesos_2122.py
I can reliably reproduce this bug with this Python script pointed at the Mesos
master.
python ./mesos_2122.py zk://10.0.0.10:2181,10.0.0.11:2181,10.0.0.12:2181/mesos
> MesosSchedulerDriver stop causes resource offer exhaustion
> ----------------------------------------------------------
>
> Key: MESOS-2122
> URL: https://issues.apache.org/jira/browse/MESOS-2122
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.20.0, 0.21.0, 0.20.1
> Environment: x86_64 Debian Wheezy (w/ mesosphere repos, packages)
> Reporter: Zach Carlson
> Attachments: mesos_2122.py
>
>
> For additional consideration, see
> https://github.com/airbnb/chronos/issues/290 and
> https://github.com/mesosphere/marathon/issues/787
> When the SchedulerProcess managed by the MesosSchedulerDriver detects a
> master, it performs a link() to the master. Libprocess proceeds to establish
> the link. Once the scheduler has performed all the work necessary, it may
> call MesosSchedulerDriver.stop(failover = true).
> This is where things go awry: at this point, the SchedulerProcess schedules a
> termination event for itself. When libprocess's schedule thread rolls
> through, it performs a cleanup() of the SchedulerProcess, as expected. Part
> of the cleanup() is calling SocketManager::exited() on the SchedulerProcess.
> The problem with this is that SocketManager::exited() cleans up the links
> from the link map, but does not actually close the sockets. Now, since
> MesosSchedulerDriver::stop() was called with failover = true, no
> DeregisterFramework message was sent, so the Mesos master believes that the
> connection (which is still active) is still valid with a registered framework
> listening for events. It sends resourceOffers to the 'valid' framework... and
> since there's nothing actually listening for events, no response is sent, no
> offers are accepted or declined, and Mesos will grind to a halt (*until
> version 0.21.0, which will (according to release notes) rescind un-responded
> offers after a configurable timeout) -- no further offers made to any
> framework, and when all current framework work has completed, no further work
> will be performed due to the offers being wasted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)