Re: Review Request 34720: Added kill executor correction to slave.

Niklas Nielsen Thu, 04 Jun 2015 13:06:55 -0700


> On June 4, 2015, 10:58 a.m., Vinod Kone wrote:
> > src/slave/slave.cpp, lines 4400-4413
> > <https://reviews.apache.org/r/34720/diff/3/?file=979257#file979257line4400>
> >
> >     This is getting a little hairy. As the TODO says we really ought bubble 
> > this up via the Termination protobuf. Have you looked into it?
> 
> Jie Yu wrote:
>     +1
>     
>     It really gets nasty here. I remembered that jaybuff said that they plan 
> to fix that. Maybe you want to sync with them first.
> 
> Niklas Nielsen wrote:
>     Mind adding a few references? Where is the Termination proto?
> 
> Jie Yu wrote:
>     Some references for the context:
>     https://issues.apache.org/jira/browse/MESOS-2035
>     https://reviews.apache.org/r/33249/ (see discussions in the review)
>     https://issues.apache.org/jira/browse/MESOS-2657
> 
> Niklas Nielsen wrote:
>     Can you be a bit more concrete about your concerns and suggested path 
> forward? Are you suggesting to land 2035 first? Is it the qosKilled bool that 
> concerns you?
> 
> Jie Yu wrote:
>     My concern is that 'killed' field in Termination is overloaded. If it is 
> true, it means either 1) user initiated killTask 2) killed due to container 
> limit has been reached (e.g., memory, it's buggy already since we also have 
> disk limit now). 3) killed due to QoS controller.
>     
>     Therefore, relying on this field to decide what 'reason' and 'status' to 
> send back is very unintuitive and hard to understand. Just feel that this is 
> a tech debt we should fix asap.
>     
>     I don't have a suggestion yet. I need to think about it.

There are a few things we need to address:

1) What status should QoS killed tasks have? Thought TASK_LOST is better than 
TASK_FAILED.
   Adding new terminal statuses requires support in frameworks and my intuition 
was that we should try to exercise the retry logic in the frameworks with 
TASK_LOST.

2) How should the framework be able tell the difference between different 
events.
   In this case, I thought it deserved it's own reason (as those are fine 
grained).

Didn't put more thought into this as I mimicked the existing code, so I am open 
to any change you guys have in mind.
While I agree with all the above, it seems like a larger issue with the 
containerizer API (and not well suited to be addressed here alone?)

> Therefore, relying on this field to decide what 'reason' and 'status' to send 
> back is very unintuitive and hard to understand. Just feel that this is a 
> tech debt we should fix asap.

This patch issues a containerizer->destroy() and this is where it ends; in 
order to distinguish the cause of kill, it looks like we need to change the 
containerizer API?

- Niklas

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review86669
-----------------------------------------------------------

On June 4, 2015, 10:43 a.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34720/
> -----------------------------------------------------------
> 
> (Updated June 4, 2015, 10:43 a.m.)
> 
> 
> Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-2653
>     https://issues.apache.org/jira/browse/MESOS-2653
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> See summary
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
>   src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 
> 
> Diff: https://reviews.apache.org/r/34720/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>

Re: Review Request 34720: Added kill executor correction to slave.

Reply via email to