[ 
https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5180:
-----------------------------
    Description: 
The existing implementation of the scheduler driver does not re-register with 
the master under some network partition cases.

When a scheduler registers with the master:
1) master links to the framework
2) framework links to the master

It is possible for either of these links to break *without* the master 
changing.  (Currently, the scheduler driver will only re-register if the master 
changes).

If both links break or if just link (1) breaks, the master views the framework 
as {{inactive}} and {{disconnected}}.  This means the framework will not 
receive any more events (such as offers) from the master until it re-registers. 
 There is currently no way for the scheduler to detect a one-way link breakage.

if link (2) breaks, it makes (almost) no difference to the scheduler.  The 
scheduler usually uses the link to send messages to the master, but libprocess 
will create another socket if the persistent one is not available.

To fix link breakages for (1+2) and (2), the scheduler driver should implement 
a `::exited` event handler for the master's {{pid}} and re-register in this 
case.

See the related issue MESOS-5181 for link (1) breakage.

  was:
The existing implementation of the scheduler driver does not re-register with 
the master under some network partition cases.

When a scheduler registers with the master:
1) master links to the framework
2) framework links to the master

It is possible for either of these links to break *without* the master 
changing.  (Currently, the scheduler driver will only re-register if the master 
changes).

If both links break or if just link (1) breaks, the master views the framework 
as {{inactive}} and {{disconnected}}.  This means the framework will not 
receive any more events (such as offers) from the master until it re-registers. 
 There is currently no way for the scheduler to detect a one-way link breakage.

if link (2) breaks, it makes (almost) no difference to the scheduler.  The 
scheduler usually uses the link to send messages to the master, but libprocess 
will create another socket if the persistent one is not available.

To fix link breakages for (1+2) and (2), the scheduler driver should implement 
a `::exited` event handler for the master's {{pid}} and re-register in this 
case.

See the related issue [TODO] for link (1) breakage.


> Scheduler driver does not detect disconnection with master and reregister.
> --------------------------------------------------------------------------
>
>                 Key: MESOS-5180
>                 URL: https://issues.apache.org/jira/browse/MESOS-5180
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 0.24.0
>            Reporter: Joseph Wu
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with 
> the master under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master 
> changing.  (Currently, the scheduler driver will only re-register if the 
> master changes).
> If both links break or if just link (1) breaks, the master views the 
> framework as {{inactive}} and {{disconnected}}.  This means the framework 
> will not receive any more events (such as offers) from the master until it 
> re-registers.  There is currently no way for the scheduler to detect a 
> one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler.  The 
> scheduler usually uses the link to send messages to the master, but 
> libprocess will create another socket if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should 
> implement a `::exited` event handler for the master's {{pid}} and re-register 
> in this case.
> See the related issue MESOS-5181 for link (1) breakage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to