Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

Jay Buffington Tue, 05 May 2015 11:41:18 -0700


> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > <https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065>
> >
> >     Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> >     
> >     Or am I missing something?
> 
> Jie Yu wrote:
>     OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
>     
>     After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
>     To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
>     Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
>     Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
>     Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
>     
>     My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
>     
>     I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
>     Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
>     
>     I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
>     
>     1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
>     
>     2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
>     
>     3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
>     ```
>     void Slave::sendExecutorTerminatedStatusUpdate(...)
>     {
>       ...
>       if (termination.isReady() && termination.get().killed()) {
>         taskState = TASK_FAILED;
>         // TODO(dhamon): MESOS-2035: Add 'reason' to 
> containerizer::Termination.
>         reason = TaskStatus::REASON_MEMORY_LIMIT;
>       }
>     }
>     ```
>     because we now have disk limit as well.
>     
>     Another issue about Termination is that containerizer->wait sometimes 
> returns a Failure (e.g., launch failed, or destroy failed), that means we 
> cannot get the reason field and message field from Termination anymore. Right 
> now, if that happens, we always set the reason to be 
> REASON_EXECUTOR_TERMINATED and the message to be "Abnormal executor 
> termination" in the status update. I think this is a hack and IMO, the 
> containerizer should always return a valid Termination protobuf to the slave 
> so that the slave can send a meaningful status update to the framework.
> 
> Jay Buffington wrote:
>     Thanks for these comments, Jie.  I'll work implementing your proposal.
>     
>     I was unware of the existance of the Termination protobuf.  I am confused 
> by these comments (which I think are just flat out wrong). 
> https://github.com/apache/mesos/blob/c36d5996327ca765f49c211d489371c99ef8e090/src/slave/slave.cpp#L3177
>  which say:
>     
>        // A termination failure indicates the containerizer could not destroy 
> a
>        // container.
>        // TODO(idownes): This is a serious error so consider aborting the 
> slave if
>        // this occurs.
>        
>     I don't understand when you would ever return Failure() in your 
> containerizer without doing container->termination.set(termination);
>     
>     Thanks!
> 
> Jie Yu wrote:
>     > I don't understand when you would ever return Failure() in your 
> containerizer without doing container->termination.set(termination);
>     
>     Yeah, I think that's a tech debt and we should correct it. Let me know if 
> you need any help for this.
> 
> Jay Buffington wrote:
>     I finally sat down to work through what you said, Jie.
>     
>     Check this out:
>     
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1194-L1202
>     
>     When pull() fails destroy() is called.  Destroy correctly sets 
> Termination, but then it deletes the container!  Then when we call 
> containerizer->wait() in slave.cpp, we get back a Failure("Unknown container: 
> ...) instead of the termination future like we should.
>     
>     IMHO the solution here isn't to split up the code from destroy() into to 
> methods: one that sets the termination (call this launchFailed()) and then 
> make destroy() just erase the container.
>     
>     Then we should only call launchFailed() from launch (here: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L636)
>  
>     
>     When launch fails we should call containerizer->destroy() from slave's 
> executorLaunched method.
> 
> Jie Yu wrote:
>     An alternative, can we simply just move 
> `containerizer->wait(containerId)` in `Slave::executorLaunched` to 
> `Framework::launchExecutor` right after `launch.onAny(defer(slave, 
> &Slave::executorLaunched, ...);`?
> 
> Jay Buffington wrote:
>     I spoke to Jie about this.  It turns out that launch and wait need to be 
> serialized to support the composing containerizer.  See 
> https://reviews.apache.org/r/23463   
>     
>     Jie questioned if destroy() really needs to be called on docker 
> containerizer launch failure.  To answer this I looked at what we do in 
> destroy if the containerizer state is <= RUNNING.  
>     
>     If the state is FETCHING destroy() will do fetcher->kill(containerId).  
> It isn't necessary to kill the fetcher if it's already failed.
>     
>     Simliarly, If state is PULLING destroy() does container->pull.discard() 
> which isn't necessary if the pull failed.
>     
>     If state is RUNNING destroy() will try to kill the executor if it has 
> started.  If "docker run" fails we know the executor isn't running.  
> destroy() goes on to attempt to run "docker stop" on the container which was 
> never created.  This results log spam regarding stopping a container that 
> doesn't exist.  
>     
>     Instead of calling destroy() on launch failure, we should call a method 
> that sets the Termination on the container with the proper message and reason 
> set.  We'll still need to erase the container object from the hashmap 
> somewhere and delete it.
>     
>     Thoughts?
> 
> Timothy Chen wrote:
>     It isn't necessary to kill the fetcher if it failed, but if destroy is 
> called in the middle of fetch we do want to make sure the fetch is killed, 
> otherwise it leaves a long running fetch process running for no reason, 
> similiar to pull.
>     RUNNING is a more interesting case, since we don't know if it's really 
> still running we just know it's passed fetch and pull, so at least we tried 
> to launch it, and stop shouldn't cause any problems.
>     
>     I think it's fine if we don't call destroy on failure, but we do have to 
> make sure to call the Termination as you mentioned, and also update the state 
> of the container.
> 
> Jie Yu wrote:
>     My thought is: we don't even need a method that sets the Termination when 
> launch fails, can we remove the `onFailed` callback in `launch` and simply 
> call `containerizer->destroy` in `executorLaunched` (right after 
> `containerizer->wait` is called) if containerizer launch fails. The destroy 
> will properly set the Termination. Also, in that way, it's guaranteed that 
> `containerizer->wait` will be called before `containerizer->destroy` so that 
> we can always get back a valid Termination.
>     
>     This solution also helps solve 
> https://issues.apache.org/jira/browse/MESOS-2656. 
>     
>     What do you guys think?


Yes, this makes sense.  We need to gaurentee that containerizer->wait and 
containerizer->destroy are serialized.

This race condition also exists in the mesos containerizer because it calls 
destroy on launch failure as well.  See 
https://github.com/apache/mesos/blob/664788b94/src/slave/containerizer/mesos/containerizer.cpp#L568-L587

Note that we'll need to update this comment as it will no longer be true:
https://github.com/apache/mesos/blob/664788b94/src/slave/containerizer/containerizer.hpp#L124-L125


- Jay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review81090
-----------------------------------------------------------


On May 4, 2015, 5:43 p.m., Jay Buffington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> -----------------------------------------------------------
> 
> (Updated May 4, 2015, 5:43 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020
>     https://issues.apache.org/jira/browse/MESOS-2020
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When mesos is unable to launch the containerizer the scheduler should
> get a TASK_FAILED with a status message that includes the error the
> containerizer encounted when trying to launch.
> 
> Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
>   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
>   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> -------
> 
> I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
> a bogus docker image url and saw the "docker pull" failure stderr message in 
> Aurora's web UI.
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

Reply via email to