> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > <https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065>
> >
> >     Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> >     
> >     Or am I missing something?
> 
> Jie Yu wrote:
>     OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
>     
>     After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
>     To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
>     Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
>     Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
>     Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
>     
>     My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
>     
>     I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
>     Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
>     
>     I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
>     
>     1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
>     
>     2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
>     
>     3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
>     ```
>     void Slave::sendExecutorTerminatedStatusUpdate(...)
>     {
>       ...
>       if (termination.isReady() && termination.get().killed()) {
>         taskState = TASK_FAILED;
>         // TODO(dhamon): MESOS-2035: Add 'reason' to 
> containerizer::Termination.
>         reason = TaskStatus::REASON_MEMORY_LIMIT;
>       }
>     }
>     ```
>     because we now have disk limit as well.
>     
>     Another issue about Termination is that containerizer->wait sometimes 
> returns a Failure (e.g., launch failed, or destroy failed), that means we 
> cannot get the reason field and message field from Termination anymore. Right 
> now, if that happens, we always set the reason to be 
> REASON_EXECUTOR_TERMINATED and the message to be "Abnormal executor 
> termination" in the status update. I think this is a hack and IMO, the 
> containerizer should always return a valid Termination protobuf to the slave 
> so that the slave can send a meaningful status update to the framework.
> 
> Jay Buffington wrote:
>     Thanks for these comments, Jie.  I'll work implementing your proposal.
>     
>     I was unware of the existance of the Termination protobuf.  I am confused 
> by these comments (which I think are just flat out wrong). 
> https://github.com/apache/mesos/blob/c36d5996327ca765f49c211d489371c99ef8e090/src/slave/slave.cpp#L3177
>  which say:
>     
>        // A termination failure indicates the containerizer could not destroy 
> a
>        // container.
>        // TODO(idownes): This is a serious error so consider aborting the 
> slave if
>        // this occurs.
>        
>     I don't understand when you would ever return Failure() in your 
> containerizer without doing container->termination.set(termination);
>     
>     Thanks!
> 
> Jie Yu wrote:
>     > I don't understand when you would ever return Failure() in your 
> containerizer without doing container->termination.set(termination);
>     
>     Yeah, I think that's a tech debt and we should correct it. Let me know if 
> you need any help for this.

I finally sat down to work through what you said, Jie.

Check this out:
https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1194-L1202

When pull() fails destroy() is called.  Destroy correctly sets Termination, but 
then it deletes the container!  Then when we call containerizer->wait() in 
slave.cpp, we get back a Failure("Unknown container: ...) instead of the 
termination future like we should.

IMHO the solution here isn't to split up the code from destroy() into to 
methods: one that sets the termination (call this launchFailed()) and then make 
destroy() just erase the container.

Then we should only call launchFailed() from launch (here: 
https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L636)
 

When launch fails we should call containerizer->destroy() from slave's 
executorLaunched method.


- Jay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review81090
-----------------------------------------------------------


On April 21, 2015, 10:14 a.m., Jay Buffington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> -----------------------------------------------------------
> 
> (Updated April 21, 2015, 10:14 a.m.)
> 
> 
> Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
> 
> 
> Bugs: MESOS-2020
>     https://issues.apache.org/jira/browse/MESOS-2020
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When mesos is unable to launch the containerizer the scheduler should
> get a TASK_FAILED with a status message that includes the error the
> containerizer encounted when trying to launch.
> 
> Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
>   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
>   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> -------
> 
> I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
> a bogus docker image url and saw the "docker pull" failure stderr message in 
> Aurora's web UI.
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>

Reply via email to