[
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043853#comment-14043853
]
Tobias Weingartner commented on MESOS-1529:
-------------------------------------------
2) What does an "exit" event signify? Why would we need to check that it was
for a leading master?
3) How is the 75 seconds determined? Does this lock us into a phased upgrade
path if this timeout value needs to change? If we get a ping from a
non-leading master, we should likely ignore it and not immediately trigger
re-registration. IE: let the timeout take effect.
> Handle a network partition between Master and Slave
> ---------------------------------------------------
>
> Key: MESOS-1529
> URL: https://issues.apache.org/jira/browse/MESOS-1529
> Project: Mesos
> Issue Type: Bug
> Reporter: Dominic Hamon
>
> If a network partition occurs between a Master and Slave, the Master will
> remove the Slave (as it fails health check) and mark the tasks being run
> there as LOST. However, the Slave is not aware that it has been removed so
> the tasks will continue to run.
> (To clarify a little bit: neither the master nor the slave receives 'exited'
> event, indicating that the connection between the master and slave is not
> closed).
> There are at least two possible approaches to solving this issue:
> 1. Introduce a health check from Slave to Master so they have a consistent
> view of a network partition. We may still see this issue should a one-way
> connection error occur.
> 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the
> Slave reappears and reconcile then. We'd still need to mark Slaves and tasks
> as potentially lost (zombie state) but maybe the Scheduler can make a more
> intelligent decision.
--
This message was sent by Atlassian JIRA
(v6.2#6252)