Neil Conway created MESOS-6676:
----------------------------------
Summary: Always re-link with scheduler during re-registration
Key: MESOS-6676
URL: https://issues.apache.org/jira/browse/MESOS-6676
Project: Mesos
Issue Type: Bug
Components: master
Reporter: Neil Conway
Assignee: Neil Conway
Scenario:
# Framework registers with master using a non-zero {{failover_timeout}} and is
assigned a FrameworkID.
# The master sees an {{ExitedEvent}} for the master->scheduler link. This could
happen due to some transient network error, e.g., 1-way partition. The master
sends a {{FrameworkErrorMessage}} to the framework. The master marks the
framework as disconnected, but keeps the {{Framework*}} for it around in
{{frameworks.registered}}.
# The framework doesn't receive the {{FrameworkErrorMessage}} because it is
dropped by the network.
# The scheduler might receive an {{ExitedEvent}} for the scheduler -> master
link, but it ignores this anyway (see MESOS-887).
# The scheduler sees a new-master-detected event and re-registers with the
master. It doesn _not_ set the {{force}} flag. This means we follow [this code
path|https://github.com/apache/mesos/blob/a6bab9015cd63121081495b8291635f386b95a92/src/master/master.cpp#L2771]
in the master, which does _not_ relink with the scheduler.
The result is that scheduler re-registration succeds, but the master ->
scheduler link is never re-established.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)