[
https://issues.apache.org/jira/browse/MESOS-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218426#comment-14218426
]
Benjamin Mahler commented on MESOS-2122:
----------------------------------------
Hm, I can't seem to dig up the ticket related to this. It's a long standing
limitation in libprocess: we don't notify termination when libprocess stays up
but a Process terminates:
https://github.com/apache/mesos/blob/0.21.0/3rdparty/libprocess/TODO#L9
To my knowledge, we've been skirting this issue because for the most part,
frameworks will be failing over when calling stop(failover=true). At which
point, a new instantiation of the framework will re-register and we'll treat
the old one as having gone away.
> MesosSchedulerDriver stop causes resource offer exhaustion
> ----------------------------------------------------------
>
> Key: MESOS-2122
> URL: https://issues.apache.org/jira/browse/MESOS-2122
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.20.0, 0.20.1
> Environment: x86_64 Debian Wheezy (w/ mesosphere repos, packages)
> Reporter: Zach Carlson
> Attachments: mesos_2122.py
>
>
> For additional consideration, see
> https://github.com/airbnb/chronos/issues/290 and
> https://github.com/mesosphere/marathon/issues/787
> When the SchedulerProcess managed by the MesosSchedulerDriver detects a
> master, it performs a link() to the master. Libprocess proceeds to establish
> the link. Once the scheduler has performed all the work necessary, it may
> call MesosSchedulerDriver.stop(failover = true).
> This is where things go awry: at this point, the SchedulerProcess schedules a
> termination event for itself. When libprocess's schedule thread rolls
> through, it performs a cleanup() of the SchedulerProcess, as expected. Part
> of the cleanup() is calling SocketManager::exited() on the SchedulerProcess.
> The problem with this is that SocketManager::exited() cleans up the links
> from the link map, but does not actually close the sockets. Now, since
> MesosSchedulerDriver::stop() was called with failover = true, no
> DeregisterFramework message was sent, so the Mesos master believes that the
> connection (which is still active) is still valid with a registered framework
> listening for events. It sends resourceOffers to the 'valid' framework... and
> since there's nothing actually listening for events, no response is sent, no
> offers are accepted or declined, and Mesos will grind to a halt (*until
> version 0.21.0, which will (according to release notes) rescind un-responded
> offers after a configurable timeout) -- no further offers made to any
> framework, and when all current framework work has completed, no further work
> will be performed due to the offers being wasted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)