> On Jan. 5, 2018, 1:25 a.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 10037-10056 (original), 10039-10062 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10039>
> >
> >     I think we shouldn't create a TASK_UNREACHABLE status update and call 
> > `updateTask` or `forward` for a terminal task at all. . Also, `forward` 
> > sends TASK_UNREACHABLE update for terminal task to the framework which 
> > looks incorrect.
> >     
> >     
> >     Ideally, we want terminal but unacknowledged tasks to still be marked 
> > unreachable in some way, either via task state being TASK_UNREACHABLE or 
> > task being present in `unreachableTasks`. This allows, for example, the 
> > WebUI to not show sandbox links for unreachable tasks irrespective of 
> > whether they were terminal or not before going unreachable. 
> >     
> >     But doing this is tricky for various reasons:
> >     
> >     --> `updateTask()` doesn't allow a terminal state to be transitioned to 
> > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, 
> > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to 
> > operator API stream subscribers which looks incorrect. The fact that 
> > `updateTask` internally deals with already terminal tasks is a bad design 
> > decision in retrospect. I think the callers shouldn't call it for terminal 
> > tasks instead.
> >     
> >     --> It's not clear to our users what a `completed` task means. The 
> > intention was for this to hold a cache of terminal and acknowledged tasks 
> > for storing recent history. The users of the WebUI probably equate 
> > "Completed Tasks" to terminal tasks irrespective of their acknowledgement 
> > status, which is why it is confusing for them to see terminal but 
> > unacknowledged tasks in the "Active tasks" section in the WebUI.
> >     
> >     --> When a framework reconciles the state of a task on an unreachable 
> > agent, master replies with TASK_UNREACHABLE irrespective of whether the 
> > task was in a non-terminal state or terminal but un-acknowledged state or 
> > terminal and acknowledged state when the agent went unreachable.  
> >     
> >     I think the direction we want to go towards is
> >     
> >     --> Completed tasks should consist of terminal unacknowledged and 
> > terminal acknowled tasks, likely in two different data structures.
> >     --> Unreachable tasks should consist of all non-complete tasks on an 
> > unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
> > state.
> >     
> >     
> >     Given all the above is a very involved change, I would recommend 
> > keeping what you have here but with a giant TODO (your current comment in 
> > #10058 doesn't go into enough detail about the complexity here) that talks 
> > about the above stuff. Your change at least keeps the parity with the 
> > (broken) semantics that we have in 1.4 and earlier so that's a bit better.
> 
> Vinod Kone wrote:
>     Ignore the first line. Forgot to delete it.
> 
> Jiang Yan Xu wrote:
>     Future direction
>     
>     1. If completed == terminal unacknowledged + terminal acknowledged, then 
> completed == terminal right? Should we then unify the terminology and pick 
> one?
>     2. Unreachable tasks == non-terminal tasks on an unreachable agent: this 
> is what this RR is going to do but IIUC you want a different behavior.
>     
>     Current semantics
>     
>     1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the 
> operator API subsribers for terminal tasks), as it stands right now we are 
> going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change 
> that?
>     2. You also said above that "Ideally, we want terminal but unacknowledged 
> tasks to still be marked unreachable in some way" which seems to contradict 
> your later point that "Unreachable tasks should consist of all non-complete 
> (terminal) tasks", could you clarify?
>     
>     Overall it sounds to me that the most correct semantic is still to set 
> `TASK_UNREACHABLE` only for the tasks that are non-terminal (because 
> otherwise we know that the state is not going to change to something else 
> that we don't know yet) but perhaps we can use another field in the status to 
> signal the fact that the agent is partitioned?

https://issues.apache.org/jira/browse/MESOS-8405


- James


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194794
-----------------------------------------------------------


On Jan. 5, 2018, 7:06 p.m., James Peach wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> -----------------------------------------------------------
> 
> (Updated Jan. 5, 2018, 7:06 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
>     https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/4/
> 
> 
> Testing
> -------
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Reply via email to