> On June 9, 2016, 9:16 a.m., Neil Conway wrote:
> > src/sched/sched.cpp, line 1001
> > <https://reviews.apache.org/r/48453/diff/1/?file=1411804#file1411804line1001>
> >
> >     What happens in the following scenario:
> >     
> >     * framework launches task with executor (=> add UPID to `taskPids`)
> >     * agent where the task is running fails health checks (=> framework 
> > receives `TASK_LOST`, which is considered a terminal state per 
> > `isTerminalState()`, so we remove the UPID from `taskPids`)
> >     * master fails over and we reregister with a new master
> >     * agent reregisters with the master; this is allowed, per non-strict 
> > registry
> >     * we get `TASK_RUNNING` for the task
> >     
> >     ISTM we won't track the executor in `executorPids`, although we should.
> >     
> >     In general, the logic here seems pretty complicated and a little 
> > arbitrary...

That is right. The affect of this would be that for sending a 
`FrameworkMessage` to the executor, it would be routed through the master 
(instead of directly to the executor).

An option to handle this case would be if we add slave PID in 
`TaskStatusUpdate` message. If we do so:
- In that case, we add `executorPids[executorId][slaveId] = SlavePID` in 
TASK_RUNNING only (if that is the first task for this executor id runing on 
this slave), ie. we track UPIDs only when we receive TASK_RUNNING (so no need 
for `taskPids`).
- In lostExecutor():
  - We clean up the executorPids for the lost executor, ie. 
`executorPids[executorId].erase(slaveId)`
  - Also:
  `if (executorPids[executorId].size() == 0) {
     executorPids.erase(executorId);
   }`

We won't need `savedSlavePids` anymore since we track PIDs based on the tuple 
`<ExecutorID, SlaveID>`.

IMO, it makes the interface cleaner and less complicated, but it involves 
adding the PID of the slave in `StatusUpdateMessage` or in `StatusUpdate`. What 
do you think?


- Anindya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48453/#review136777
-----------------------------------------------------------


On June 9, 2016, 1:08 a.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48453/
> -----------------------------------------------------------
> 
> (Updated June 9, 2016, 1:08 a.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5143
>     https://issues.apache.org/jira/browse/MESOS-5143
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Since UPIDs are tracked in the scheduler driver to be able to directly
> send FrameworkMessage to executor, we now track UPIDs for an executor
> running on an agent (instead for a slave). We track this mapping only
> for the life of the executor (instead of the life of the agent). This
> enables us to avoid sending lost slave message to all frameworks
> (instead of relevant frameworks only).
> 
> 
> Diffs
> -----
> 
>   src/sched/sched.cpp 9f561d73a2e591afdc3ba4adb35a11763dced402 
> 
> Diff: https://reviews.apache.org/r/48453/diff/
> 
> 
> Testing
> -------
> 
> All tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>

Reply via email to