> On Jan. 4, 2018, 5:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 10037-10056 (original), 10039-10062 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10039>
> >
> >     I think we shouldn't create a TASK_UNREACHABLE status update and call 
> > `updateTask` or `forward` for a terminal task at all. . Also, `forward` 
> > sends TASK_UNREACHABLE update for terminal task to the framework which 
> > looks incorrect.
> >     
> >     
> >     Ideally, we want terminal but unacknowledged tasks to still be marked 
> > unreachable in some way, either via task state being TASK_UNREACHABLE or 
> > task being present in `unreachableTasks`. This allows, for example, the 
> > WebUI to not show sandbox links for unreachable tasks irrespective of 
> > whether they were terminal or not before going unreachable. 
> >     
> >     But doing this is tricky for various reasons:
> >     
> >     --> `updateTask()` doesn't allow a terminal state to be transitioned to 
> > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, 
> > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to 
> > operator API stream subscribers which looks incorrect. The fact that 
> > `updateTask` internally deals with already terminal tasks is a bad design 
> > decision in retrospect. I think the callers shouldn't call it for terminal 
> > tasks instead.
> >     
> >     --> It's not clear to our users what a `completed` task means. The 
> > intention was for this to hold a cache of terminal and acknowledged tasks 
> > for storing recent history. The users of the WebUI probably equate 
> > "Completed Tasks" to terminal tasks irrespective of their acknowledgement 
> > status, which is why it is confusing for them to see terminal but 
> > unacknowledged tasks in the "Active tasks" section in the WebUI.
> >     
> >     --> When a framework reconciles the state of a task on an unreachable 
> > agent, master replies with TASK_UNREACHABLE irrespective of whether the 
> > task was in a non-terminal state or terminal but un-acknowledged state or 
> > terminal and acknowledged state when the agent went unreachable.  
> >     
> >     I think the direction we want to go towards is
> >     
> >     --> Completed tasks should consist of terminal unacknowledged and 
> > terminal acknowled tasks, likely in two different data structures.
> >     --> Unreachable tasks should consist of all non-complete tasks on an 
> > unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
> > state.
> >     
> >     
> >     Given all the above is a very involved change, I would recommend 
> > keeping what you have here but with a giant TODO (your current comment in 
> > #10058 doesn't go into enough detail about the complexity here) that talks 
> > about the above stuff. Your change at least keeps the parity with the 
> > (broken) semantics that we have in 1.4 and earlier so that's a bit better.
> 
> Vinod Kone wrote:
>     Ignore the first line. Forgot to delete it.
> 
> Jiang Yan Xu wrote:
>     Future direction
>     
>     1. If completed == terminal unacknowledged + terminal acknowledged, then 
> completed == terminal right? Should we then unify the terminology and pick 
> one?
>     2. Unreachable tasks == non-terminal tasks on an unreachable agent: this 
> is what this RR is going to do but IIUC you want a different behavior.
>     
>     Current semantics
>     
>     1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the 
> operator API subsribers for terminal tasks), as it stands right now we are 
> going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change 
> that?
>     2. You also said above that "Ideally, we want terminal but unacknowledged 
> tasks to still be marked unreachable in some way" which seems to contradict 
> your later point that "Unreachable tasks should consist of all non-complete 
> (terminal) tasks", could you clarify?
>     
>     Overall it sounds to me that the most correct semantic is still to set 
> `TASK_UNREACHABLE` only for the tasks that are non-terminal (because 
> otherwise we know that the state is not going to change to something else 
> that we don't know yet) but perhaps we can use another field in the status to 
> signal the fact that the agent is partitioned?
> 
> James Peach wrote:
>     https://issues.apache.org/jira/browse/MESOS-8405
> 
> Vinod Kone wrote:
>     Sorry for the confusion. In the future direction section it should be 
> "Unreachable tasks should consist of all tasks on an unreachable agent.". 
> Note that this is what is returned when a framework does explicit 
> reconciliation for example. I guess this means a terminal task can go to 
> UNREACHABLE and back depending on the state of the agent, which I agree is 
> also a bit weird. Basically the TASK_UNREACHABLE state is being used as a 
> proxy for the agent state, it's not really task state. I think that's source 
> of the confusion. In retrospect, maybe the unreachability of an agent 
> should've been indicated to a framework via a different signal (say another 
> field in the status update) than task state. 
>     
>     
>     As an aside, I think we want to send TASK_UNREACHABLE updates for 
> unreachable tasks during implicit reconciliation; I think we overlooked 
> sending them.

> unreachability of an agent should've been indicated to a framework via a 
> different signal (say another field in the status update) than task state. 

Agreed. I was suggesting something similar above as well.

So, coming back to the behavior we want for 1.5 for unacknowledged terminal 
tasks when the agent is partitioned:

- Scheduler and operator subscribers are sent TASK_UNREACHABLE updates (we are 
doing this right now but we need to clarify the differences: I see Vinod 
already pointed this out).
- **I missed this earlier**: Unacknowledged terminal tasks need to be put in 
`framework.unreachableTasks` so they are not put in `framework.completedTasks` 
and are removed from this map when the agent comes back. The current version of 
this RR (and the 1.4 version) would've put these tasks into 
`framework.completedTasks` when parititioning happens and they are not removed 
from `framework.completedTasks` when the agent comes back while also being 
added back to `framework.tasks` (duplication occurs). So I think we probably 
need to maintain the behavior of the current master version (terminal but 
unacknowlwedged tasks are put to `framework.unreachableTasks` with the state 
`TASK_UNREACHABLE`) but remove the checks that fail.
- This way, during parition whe webUI would show these terminal tasks in 
`TASK_UNREACHABLE` state and the operator API calls (not subscribers) would get 
the tasks in the `unreachable_tasks` section (otherwise they would've showed up 
in the `completed_tasks` section which would seem worse in the current 
semantics.

Does it make sense?


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194794
-----------------------------------------------------------


On Jan. 5, 2018, 11:37 a.m., James Peach wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> -----------------------------------------------------------
> 
> (Updated Jan. 5, 2018, 11:37 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
>     https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/5/
> 
> 
> Testing
> -------
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Reply via email to