> On June 4, 2015, 10:58 a.m., Vinod Kone wrote: > > src/slave/slave.cpp, lines 4400-4413 > > <https://reviews.apache.org/r/34720/diff/3/?file=979257#file979257line4400> > > > > This is getting a little hairy. As the TODO says we really ought bubble > > this up via the Termination protobuf. Have you looked into it? > > Jie Yu wrote: > +1 > > It really gets nasty here. I remembered that jaybuff said that they plan > to fix that. Maybe you want to sync with them first. > > Niklas Nielsen wrote: > Mind adding a few references? Where is the Termination proto? > > Jie Yu wrote: > Some references for the context: > https://issues.apache.org/jira/browse/MESOS-2035 > https://reviews.apache.org/r/33249/ (see discussions in the review) > https://issues.apache.org/jira/browse/MESOS-2657 > > Niklas Nielsen wrote: > Can you be a bit more concrete about your concerns and suggested path > forward? Are you suggesting to land 2035 first? Is it the qosKilled bool that > concerns you? > > Jie Yu wrote: > My concern is that 'killed' field in Termination is overloaded. If it is > true, it means either 1) user initiated killTask 2) killed due to container > limit has been reached (e.g., memory, it's buggy already since we also have > disk limit now). 3) killed due to QoS controller. > > Therefore, relying on this field to decide what 'reason' and 'status' to > send back is very unintuitive and hard to understand. Just feel that this is > a tech debt we should fix asap. > > I don't have a suggestion yet. I need to think about it.
There are a few things we need to address: 1) What status should QoS killed tasks have? Thought TASK_LOST is better than TASK_FAILED. Adding new terminal statuses requires support in frameworks and my intuition was that we should try to exercise the retry logic in the frameworks with TASK_LOST. 2) How should the framework be able tell the difference between different events. In this case, I thought it deserved it's own reason (as those are fine grained). Didn't put more thought into this as I mimicked the existing code, so I am open to any change you guys have in mind. While I agree with all the above, it seems like a larger issue with the containerizer API (and not well suited to be addressed here alone?) > Therefore, relying on this field to decide what 'reason' and 'status' to send > back is very unintuitive and hard to understand. Just feel that this is a > tech debt we should fix asap. This patch issues a containerizer->destroy() and this is where it ends; in order to distinguish the cause of kill, it looks like we need to change the containerizer API? - Niklas ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/34720/#review86669 ----------------------------------------------------------- On June 4, 2015, 10:43 a.m., Niklas Nielsen wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/34720/ > ----------------------------------------------------------- > > (Updated June 4, 2015, 10:43 a.m.) > > > Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and > Vinod Kone. > > > Bugs: MESOS-2653 > https://issues.apache.org/jira/browse/MESOS-2653 > > > Repository: mesos > > > Description > ------- > > See summary > > > Diffs > ----- > > src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 > src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea > > Diff: https://reviews.apache.org/r/34720/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Niklas Nielsen > >