[
https://issues.apache.org/jira/browse/MESOS-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002529#comment-14002529
]
Dominic Hamon commented on MESOS-1374:
--------------------------------------
There is one instance that is problematic: When a Scheduler fails over. From
{{master.cpp}}:
{code}
3148 // Replace the scheduler for a framework with a new process ID, in the
3149 // event of a scheduler failover.
3150 void Master::failoverFramework(Framework* framework, const UPID& newPid)
3151 {
3152 const UPID& oldPid = framework->pid;
3153
3154 // There are a few failover cases to consider:
3155 // 1. The pid has changed. In this case we definitely want to
3156 // send a FrameworkErrorMessage to shut down the older
3157 // scheduler.
3158 // 2. The pid has not changed.
3159 // 2.1 The old scheduler on that pid failed over to a new
3160 // instance on the same pid. No need to shut down the old
3161 // scheduler as it is necessarily dead.
3162 // 2.2 This is a duplicate message. In this case, the scheduler
3163 // has not failed over, so we do not want to shut it down.
3164 if (oldPid != newPid) {
3165 FrameworkErrorMessage message;
3166 message.set_message("Framework failed over");
3167 send(oldPid, message);
3168 }
{code}
If the port doesn't change, then the pid will be the same (point 2.1 above)
however there is a chance that an 'exited' message from the old Framework is
enqueued. If that happens, we won't shut down the Framework here (correctly)
but we will then deactivate the Framework incorrectly when we get to exited.
See also this comment from elsewhere in master.cpp:
{code}
1093 // We do not attempt to detect a duplicate re-registration
1094 // message here because it is impossible to distinguish between
1095 // a duplicate message, and a scheduler failover to the same
1096 // pid, given the existing libprocess primitives (PID does not
1097 // identify the libprocess Process instance).
{code}
The solution is to ensure that the pid is unique when the port is static which
means that we need to change the {{SchedulerProcess}} id to include a UUID.
This doesn't affect the ability to access the metrics endpoint as that is
exposed from the {{MetricsProcess}} and will be
{{<ip>:<port>/metrics/snapshot}}.
> Verify static libprocess scheduler port works with Mesos Master
> ---------------------------------------------------------------
>
> Key: MESOS-1374
> URL: https://issues.apache.org/jira/browse/MESOS-1374
> Project: Mesos
> Issue Type: Task
> Components: framework, master
> Reporter: Chris Lambert
> Assignee: Dominic Hamon
> Labels: 5
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)