[ 
https://issues.apache.org/jira/browse/MESOS-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002529#comment-14002529
 ] 

Dominic Hamon commented on MESOS-1374:
--------------------------------------

There is one instance that is problematic: When a Scheduler fails over. From 
{{master.cpp}}:
{code}
  3148 // Replace the scheduler for a framework with a new process ID, in the
  3149 // event of a scheduler failover.
  3150 void Master::failoverFramework(Framework* framework, const UPID& newPid)
  3151 {
  3152   const UPID& oldPid = framework->pid;
  3153
  3154   // There are a few failover cases to consider:
  3155   //   1. The pid has changed. In this case we definitely want to
  3156   //      send a FrameworkErrorMessage to shut down the older
  3157   //      scheduler.
  3158   //   2. The pid has not changed.
  3159   //      2.1 The old scheduler on that pid failed over to a new
  3160   //          instance on the same pid. No need to shut down the old
  3161   //          scheduler as it is necessarily dead.
  3162   //      2.2 This is a duplicate message. In this case, the scheduler
  3163   //          has not failed over, so we do not want to shut it down.
  3164   if (oldPid != newPid) {
  3165     FrameworkErrorMessage message;
  3166     message.set_message("Framework failed over");
  3167     send(oldPid, message);
  3168   }
{code}

If the port doesn't change, then the pid will be the same (point 2.1 above) 
however there is a chance that an 'exited' message from the old Framework is 
enqueued. If that happens, we won't shut down the Framework here (correctly) 
but we will then deactivate the Framework incorrectly when we get to exited.

See also this comment from elsewhere in master.cpp:

{code}
  1093       // We do not attempt to detect a duplicate re-registration
  1094       // message here because it is impossible to distinguish between
  1095       // a duplicate message, and a scheduler failover to the same
  1096       // pid, given the existing libprocess primitives (PID does not
  1097       // identify the libprocess Process instance).
{code}

The solution is to ensure that the pid is unique when the port is static which 
means that we need to change the {{SchedulerProcess}} id to include a UUID. 
This doesn't affect the ability to access the metrics endpoint as that is 
exposed from the {{MetricsProcess}} and will be 
{{<ip>:<port>/metrics/snapshot}}.


> Verify static libprocess scheduler port works with Mesos Master
> ---------------------------------------------------------------
>
>                 Key: MESOS-1374
>                 URL: https://issues.apache.org/jira/browse/MESOS-1374
>             Project: Mesos
>          Issue Type: Task
>          Components: framework, master
>            Reporter: Chris Lambert
>            Assignee: Dominic Hamon
>              Labels: 5
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to