[ 
https://issues.apache.org/jira/browse/MESOS-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218426#comment-14218426
 ] 

Benjamin Mahler commented on MESOS-2122:
----------------------------------------

Hm, I can't seem to dig up the ticket related to this. It's a long standing 
limitation in libprocess: we don't notify termination when libprocess stays up 
but a Process terminates:
https://github.com/apache/mesos/blob/0.21.0/3rdparty/libprocess/TODO#L9

To my knowledge, we've been skirting this issue because for the most part, 
frameworks will be failing over when calling stop(failover=true). At which 
point, a new instantiation of the framework will re-register and we'll treat 
the old one as having gone away.

> MesosSchedulerDriver stop causes resource offer exhaustion
> ----------------------------------------------------------
>
>                 Key: MESOS-2122
>                 URL: https://issues.apache.org/jira/browse/MESOS-2122
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.20.0, 0.20.1
>         Environment: x86_64 Debian Wheezy (w/ mesosphere repos, packages)
>            Reporter: Zach Carlson
>         Attachments: mesos_2122.py
>
>
> For additional consideration, see 
> https://github.com/airbnb/chronos/issues/290 and 
> https://github.com/mesosphere/marathon/issues/787
> When the SchedulerProcess managed by the MesosSchedulerDriver detects a 
> master, it performs a link() to the master. Libprocess proceeds to establish 
> the link. Once the scheduler has performed all the work necessary, it may 
> call MesosSchedulerDriver.stop(failover = true). 
> This is where things go awry: at this point, the SchedulerProcess schedules a 
> termination event for itself. When libprocess's schedule thread rolls 
> through, it performs a cleanup() of the SchedulerProcess, as expected. Part 
> of the cleanup() is calling SocketManager::exited() on the SchedulerProcess. 
> The problem with this is that SocketManager::exited() cleans up the links 
> from the link map, but does not actually close the sockets. Now, since 
> MesosSchedulerDriver::stop() was called with failover = true, no 
> DeregisterFramework message was sent, so the Mesos master believes that the 
> connection (which is still active) is still valid with a registered framework 
> listening for events. It sends resourceOffers to the 'valid' framework... and 
> since there's nothing actually listening for events, no response is sent, no 
> offers are accepted or declined, and Mesos will grind to a halt (*until 
> version 0.21.0, which will (according to release notes) rescind un-responded 
> offers after a configurable timeout) -- no further offers made to any 
> framework, and when all current framework work has completed, no further work 
> will be performed due to the offers being wasted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to