> On June 12, 2015, 6:43 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 4383-4400
> > <https://reviews.apache.org/r/34720/diff/4/?file=983235#file983235line4383>
> >
> >     First of all, we don't need to check if 'executor' is NULL or not here. 
> > If you looked at the callers, it's guaranteed that 'executor' will be 
> > non-NULL. So you need a CHECK here.
> >     
> >     Also, I thought about what the correct logic should be in this funtion 
> > (see below). Let me know your thoughts. (we have too many tech debts in 
> > this function, so we better fix this asap, instead of waiting further).
> >     
> >     ```
> >     TaskState state;
> >     TaskStatus::Reason reason;
> >     string message;
> >     
> >     // Determine the state and reason for the status update.
> >     if (termination.isReady() && termination.get().has_reason()) {
> >       // This is the case where the resource limit has been
> >       // reached and the container destroyed itself.
> >       state = TASK_FAILED;
> >       reason = termination.get().reason();
> >     } else if (executor->reason.isSome()) {
> >       // This is the case where the slave initiates the
> >       // termination of the container.
> >       
> >       // NOTE: For command executors, previously we send TASK_FAILED
> >       // and REASON_COMMAND_EXECUTOR_FAILED when the command executor
> >       // failed to launch (e.g., invalid executor path).
> >       if (executor->isCommandExecutor() &&
> >           executor->reason.get() == REASON_EXECUTOR_REGISTRATION_TIMEOUT) {
> >         state = TASK_FAILED;
> >         reason = REASON_COMMAND_EXECUTOR_FAILED;
> >       } else {
> >         state = TASK_LOST;
> >         reason = executor->reason.get();
> >       }
> >     } else {
> >       state = TASK_LOST;
> >       reason = TaskStatus::REASON_EXECUTOR_TERMINATED
> >     }
> >     
> >     // Determine the message for the status update.
> >     // TODO(nnielsen): Consider adding a message from the slave
> >     // as well, and concatenating the messages.
> >     if (termination.isReady()) {
> >       message = termination.get().message();
> >     } else {
> >       // This is case where the containerizer fails
> >       // to destroy the container.
> >       message = "Abnormal executor termination"; 
> >     }
> >     
> >     statusUpdate(...);
> >     ```
> 
> Jie Yu wrote:
>     Forgot to mention that you need to set executor->reason accordingly when 
> slave initiates a destroy of the container (e.g., containerizer->update 
> fails, executor registration timeout, etc.).
> 
> Vinod Kone wrote:
>     Changing the command task state from TASK_FAILED to TASK_LOST is an API 
> change! It should've a deprecation cycle with a headsup given on the dev 
> list. I would recommend doing:
>     
>     ```
>     // Determine the state and reason for the status update.
>     if (termination.isReady() && termination.get().has_reason()) {
>       // This is the case where the resource limit has been
>       // reached and the container destroyed itself.
>       state = TASK_FAILED;
>       reason = termination.get().reason();
>     } else  if (executor->isCommandExecutor()) {
>         // TODO(jieyu): .....
>         state = TASK_FAILED;
>         reason = REASON_COMMAND_EXECUTOR_FAILED;
>     } else if (executor->reason.isSome()) {
>       // This is the case where the slave initiates the
>       // termination of the container.
>         state = TASK_LOST;
>         reason = executor->reason.get();
>     } else {
>       state = TASK_LOST;
>       reason = TaskStatus::REASON_EXECUTOR_TERMINATED
>     }
>     
>     ```

OK, that's why exactly the reason why I want this to be fixed this asap 
(otherwise, we'll have to deal with another API breaking change).

Vinod, your code does not work for niq. For example, if a command executor is 
killed due to QOS violation, we will send TASK_FAILED and 
REASON_COMMAND_EXECUTOR_FAILED. That's not correct.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review87734
-----------------------------------------------------------


On June 11, 2015, 10:58 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34720/
> -----------------------------------------------------------
> 
> (Updated June 11, 2015, 10:58 p.m.)
> 
> 
> Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-2653
>     https://issues.apache.org/jira/browse/MESOS-2653
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> See summary
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 0df1b55791963fb4159b7ea5318d09dde4f7d8c7 
>   src/slave/slave.cpp b523c2fce50e56f4f94d55a9488f49c53452e4d4 
> 
> Diff: https://reviews.apache.org/r/34720/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>

Reply via email to