> On April 21, 2015, 11:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > <https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065>
> >
> >     Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> >     
> >     Or am I missing something?
> Jie Yu wrote:
>     OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
>     After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> Jie Yu wrote:
>     To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> Jie Yu wrote:
>     Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> Timothy Chen wrote:
>     Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> Jay Buffington wrote:
>     Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
>     My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
>     I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.

Thanks for clarifying it, Jay! In fact, my proposal above should be able to 
solve the third problem cleanly. Check out 
`Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should properly 
set the message and reason fields in the Termination protobuf (basically why 
the container gets terminated and what's the error message). The slave will 
forward the reason and message to the scheduler through status update.

I chatted with BenM about this yesterday, and there are a couple of notes I 
want to write down here.

1. We probably need multiple levels for TaskStatus::Reason. In other words, we 
probably need a "repeated Reason reasons" field in status update message. For 
instance, for a containerizer launch failure, we probably need two reasons: 1) 
the top level reason REASON_EXECUTOR_TERMINATED; 2) the second level reason 

2. We probably want to allow extension to TaskStatus::Reason enum. For example, 
some framework/executor may want to add customized reasons. We could leverage 
the protobuf extension support to achieve that 

3. The semantics around Termination is broken currently and we need to clean it 
up. For instance, the following code is broken,
void Slave::sendExecutorTerminatedStatusUpdate(...)
  if (termination.isReady() && termination.get().killed()) {
    taskState = TASK_FAILED;
    // TODO(dhamon): MESOS-2035: Add 'reason' to containerizer::Termination.
    reason = TaskStatus::REASON_MEMORY_LIMIT;
because we now have disk limit as well.

Another issue about Termination is that containerizer->wait sometimes returns a 
Failure (e.g., launch failed, or destroy failed), that means we cannot get the 
reason field and message field from Termination anymore. Right now, if that 
happens, we always set the reason to be REASON_EXECUTOR_TERMINATED and the 
message to be "Abnormal executor termination" in the status update. I think 
this is a hack and IMO, the containerizer should always return a valid 
Termination protobuf to the slave so that the slave can send a meaningful 
status update to the framework.

- Jie

This is an automatically generated e-mail. To reply, visit:

On April 21, 2015, 5:14 p.m., Jay Buffington wrote:
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> -----------------------------------------------------------
> (Updated April 21, 2015, 5:14 p.m.)
> Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
> Bugs: MESOS-2020
>     https://issues.apache.org/jira/browse/MESOS-2020
> Repository: mesos
> Description
> -------
> When mesos is unable to launch the containerizer the scheduler should
> get a TASK_FAILED with a status message that includes the error the
> containerizer encounted when trying to launch.
> Fixes MESOS-2020
> Diffs
> -----
>   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
>   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
>   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
> Diff: https://reviews.apache.org/r/33249/diff/
> Testing
> -------
> I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
> a bogus docker image url and saw the "docker pull" failure stderr message in 
> Aurora's web UI.
> Thanks,
> Jay Buffington

Reply via email to