> On June 9, 2016, 9:16 a.m., Neil Conway wrote:
> > src/sched/sched.cpp, line 1001
> > <https://reviews.apache.org/r/48453/diff/1/?file=1411804#file1411804line1001>
> >
> > What happens in the following scenario:
> >
> > * framework launches task with executor (=> add UPID to `taskPids`)
> > * agent where the task is running fails health checks (=> framework
> > receives `TASK_LOST`, which is considered a terminal state per
> > `isTerminalState()`, so we remove the UPID from `taskPids`)
> > * master fails over and we reregister with a new master
> > * agent reregisters with the master; this is allowed, per non-strict
> > registry
> > * we get `TASK_RUNNING` for the task
> >
> > ISTM we won't track the executor in `executorPids`, although we should.
> >
> > In general, the logic here seems pretty complicated and a little
> > arbitrary...
That is right. The affect of this would be that for sending a
`FrameworkMessage` to the executor, it would be routed through the master
(instead of directly to the executor).
An option to handle this case would be if we add slave PID in
`TaskStatusUpdate` message. If we do so:
- In that case, we add `executorPids[executorId][slaveId] = SlavePID` in
TASK_RUNNING only (if that is the first task for this executor id runing on
this slave), ie. we track UPIDs only when we receive TASK_RUNNING (so no need
for `taskPids`).
- In lostExecutor():
- We clean up the executorPids for the lost executor, ie.
`executorPids[executorId].erase(slaveId)`
- Also:
`if (executorPids[executorId].size() == 0) {
executorPids.erase(executorId);
}`
We won't need `savedSlavePids` anymore since we track PIDs based on the tuple
`<ExecutorID, SlaveID>`.
IMO, it makes the interface cleaner and less complicated, but it involves
adding the PID of the slave in `StatusUpdateMessage` or in `StatusUpdate`. What
do you think?
- Anindya
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48453/#review136777
-----------------------------------------------------------
On June 9, 2016, 1:08 a.m., Anindya Sinha wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48453/
> -----------------------------------------------------------
>
> (Updated June 9, 2016, 1:08 a.m.)
>
>
> Review request for mesos and Jiang Yan Xu.
>
>
> Bugs: MESOS-5143
> https://issues.apache.org/jira/browse/MESOS-5143
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Since UPIDs are tracked in the scheduler driver to be able to directly
> send FrameworkMessage to executor, we now track UPIDs for an executor
> running on an agent (instead for a slave). We track this mapping only
> for the life of the executor (instead of the life of the agent). This
> enables us to avoid sending lost slave message to all frameworks
> (instead of relevant frameworks only).
>
>
> Diffs
> -----
>
> src/sched/sched.cpp 9f561d73a2e591afdc3ba4adb35a11763dced402
>
> Diff: https://reviews.apache.org/r/48453/diff/
>
>
> Testing
> -------
>
> All tests passed.
>
>
> Thanks,
>
> Anindya Sinha
>
>