[
https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anand Mazumdar updated MESOS-5180:
----------------------------------
Sprint: (was: Mesosphere Sprint 33)
> Scheduler driver does not detect disconnection with master and reregister.
> --------------------------------------------------------------------------
>
> Key: MESOS-5180
> URL: https://issues.apache.org/jira/browse/MESOS-5180
> Project: Mesos
> Issue Type: Bug
> Components: scheduler driver
> Affects Versions: 0.24.0
> Reporter: Joseph Wu
> Assignee: Anand Mazumdar
> Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with
> the master under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master
> changing. (Currently, the scheduler driver will only re-register if the
> master changes).
> If both links break or if just link (1) breaks, the master views the
> framework as {{inactive}} and {{disconnected}}. This means the framework
> will not receive any more events (such as offers) from the master until it
> re-registers. There is currently no way for the scheduler to detect a
> one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler. The
> scheduler usually uses the link to send messages to the master, but
> libprocess will create another socket if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should
> implement a `::exited` event handler for the master's {{pid}} and trigger a
> master (re-)detection upon a disconnection. This in turn should make the
> driver (re)-register with the master. The scheduler library already does
> this:
> https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L395
> See the related issue MESOS-5181 for link (1) breakage.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)