> On Jan. 5, 2018, 1:25 a.m., Vinod Kone wrote: > > src/master/master.cpp > > Lines 10037-10056 (original), 10039-10062 (patched) > > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10039> > > > > I think we shouldn't create a TASK_UNREACHABLE status update and call > > `updateTask` or `forward` for a terminal task at all. . Also, `forward` > > sends TASK_UNREACHABLE update for terminal task to the framework which > > looks incorrect. > > > > > > Ideally, we want terminal but unacknowledged tasks to still be marked > > unreachable in some way, either via task state being TASK_UNREACHABLE or > > task being present in `unreachableTasks`. This allows, for example, the > > WebUI to not show sandbox links for unreachable tasks irrespective of > > whether they were terminal or not before going unreachable. > > > > But doing this is tricky for various reasons: > > > > --> `updateTask()` doesn't allow a terminal state to be transitioned to > > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, > > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to > > operator API stream subscribers which looks incorrect. The fact that > > `updateTask` internally deals with already terminal tasks is a bad design > > decision in retrospect. I think the callers shouldn't call it for terminal > > tasks instead. > > > > --> It's not clear to our users what a `completed` task means. The > > intention was for this to hold a cache of terminal and acknowledged tasks > > for storing recent history. The users of the WebUI probably equate > > "Completed Tasks" to terminal tasks irrespective of their acknowledgement > > status, which is why it is confusing for them to see terminal but > > unacknowledged tasks in the "Active tasks" section in the WebUI. > > > > --> When a framework reconciles the state of a task on an unreachable > > agent, master replies with TASK_UNREACHABLE irrespective of whether the > > task was in a non-terminal state or terminal but un-acknowledged state or > > terminal and acknowledged state when the agent went unreachable. > > > > I think the direction we want to go towards is > > > > --> Completed tasks should consist of terminal unacknowledged and > > terminal acknowled tasks, likely in two different data structures. > > --> Unreachable tasks should consist of all non-complete tasks on an > > unreachable agent. All the tasks in this map should be in TASK_UNREACHABLE > > state. > > > > > > Given all the above is a very involved change, I would recommend > > keeping what you have here but with a giant TODO (your current comment in > > #10058 doesn't go into enough detail about the complexity here) that talks > > about the above stuff. Your change at least keeps the parity with the > > (broken) semantics that we have in 1.4 and earlier so that's a bit better. > > Vinod Kone wrote: > Ignore the first line. Forgot to delete it. > > Jiang Yan Xu wrote: > Future direction > > 1. If completed == terminal unacknowledged + terminal acknowledged, then > completed == terminal right? Should we then unify the terminology and pick > one? > 2. Unreachable tasks == non-terminal tasks on an unreachable agent: this > is what this RR is going to do but IIUC you want a different behavior. > > Current semantics > > 1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the > operator API subsribers for terminal tasks), as it stands right now we are > going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change > that? > 2. You also said above that "Ideally, we want terminal but unacknowledged > tasks to still be marked unreachable in some way" which seems to contradict > your later point that "Unreachable tasks should consist of all non-complete > (terminal) tasks", could you clarify? > > Overall it sounds to me that the most correct semantic is still to set > `TASK_UNREACHABLE` only for the tasks that are non-terminal (because > otherwise we know that the state is not going to change to something else > that we don't know yet) but perhaps we can use another field in the status to > signal the fact that the agent is partitioned?
https://issues.apache.org/jira/browse/MESOS-8405 - James ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64940/#review194794 ----------------------------------------------------------- On Jan. 5, 2018, 7:06 p.m., James Peach wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64940/ > ----------------------------------------------------------- > > (Updated Jan. 5, 2018, 7:06 p.m.) > > > Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod > Kone, and Jiang Yan Xu. > > > Bugs: MESOS-8337 > https://issues.apache.org/jira/browse/MESOS-8337 > > > Repository: mesos > > > Description > ------- > > If an agent is lost, we try to remove all the tasks that might > have been lost. However, if a task is already terminal, it hasn't > really been lost so we should not be tracking it in the framework's > unreachable tasks list. > > > Diffs > ----- > > src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 > src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 > src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a > > > Diff: https://reviews.apache.org/r/64940/diff/4/ > > > Testing > ------- > > make check (Fedora 27) > > > Thanks, > > James Peach > >