> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > <https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065>
> >
> >     Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> >     
> >     Or am I missing something?
> Jie Yu wrote:
>     OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
>     After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> Jie Yu wrote:
>     To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> Jie Yu wrote:
>     Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> Timothy Chen wrote:
>     Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> Jay Buffington wrote:
>     Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
>     My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
>     I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> Jie Yu wrote:
>     Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
>     I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
>     1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
>     2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
>     3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
>     ```
>     void Slave::sendExecutorTerminatedStatusUpdate(...)
>     {
>       ...
>       if (termination.isReady() && termination.get().killed()) {
>         taskState = TASK_FAILED;
>         // TODO(dhamon): MESOS-2035: Add 'reason' to 
> containerizer::Termination.
>         reason = TaskStatus::REASON_MEMORY_LIMIT;
>       }
>     }
>     ```
>     because we now have disk limit as well.
>     Another issue about Termination is that containerizer->wait sometimes 
> returns a Failure (e.g., launch failed, or destroy failed), that means we 
> cannot get the reason field and message field from Termination anymore. Right 
> now, if that happens, we always set the reason to be 
> REASON_EXECUTOR_TERMINATED and the message to be "Abnormal executor 
> termination" in the status update. I think this is a hack and IMO, the 
> containerizer should always return a valid Termination protobuf to the slave 
> so that the slave can send a meaningful status update to the framework.

Thanks for these comments, Jie.  I'll work implementing your proposal.

I was unware of the existance of the Termination protobuf.  I am confused by 
these comments (which I think are just flat out wrong). 
 which say:

   // A termination failure indicates the containerizer could not destroy a
   // container.
   // TODO(idownes): This is a serious error so consider aborting the slave if
   // this occurs.
I don't understand when you would ever return Failure() in your containerizer 
without doing container->termination.set(termination);


- Jay

This is an automatically generated e-mail. To reply, visit:

On April 21, 2015, 10:14 a.m., Jay Buffington wrote:
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> -----------------------------------------------------------
> (Updated April 21, 2015, 10:14 a.m.)
> Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
> Bugs: MESOS-2020
>     https://issues.apache.org/jira/browse/MESOS-2020
> Repository: mesos
> Description
> -------
> When mesos is unable to launch the containerizer the scheduler should
> get a TASK_FAILED with a status message that includes the error the
> containerizer encounted when trying to launch.
> Fixes MESOS-2020
> Diffs
> -----
>   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
>   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
>   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
> Diff: https://reviews.apache.org/r/33249/diff/
> Testing
> -------
> I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
> a bogus docker image url and saw the "docker pull" failure stderr message in 
> Aurora's web UI.
> Thanks,
> Jay Buffington

Reply via email to