Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

Timothy Chen Mon, 04 May 2015 23:34:06 -0700


> On April 21, 2015, 11:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > <https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065>
> >
> >     Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> >     
> >     Or am I missing something?
> 
> Jie Yu wrote:
>     OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
>     
>     After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
>     To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
>     Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
>     Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
>     Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
>     
>     My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
>     
>     I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
>     Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
>     
>     I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
>     
>     1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
>     
>     2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
>     
>     3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
>     ```
>     void Slave::sendExecutorTerminatedStatusUpdate(...)
>     {
>       ...
>       if (termination.isReady() && termination.get().killed()) {
>         taskState = TASK_FAILED;
>         // TODO(dhamon): MESOS-2035: Add 'reason' to 
> containerizer::Termination.
>         reason = TaskStatus::REASON_MEMORY_LIMIT;
>       }
>     }
>     ```
>     because we now have disk limit as well.
>     
>     Another issue about Termination is that containerizer->wait sometimes 
> returns a Failure (e.g., launch failed, or destroy failed), that means we 
> cannot get the reason field and message field from Termination anymore. Right 
> now, if that happens, we always set the reason to be 
> REASON_EXECUTOR_TERMINATED and the message to be "Abnormal executor 
> termination" in the status update. I think this is a hack and IMO, the 
> containerizer should always return a valid Termination protobuf to the slave 
> so that the slave can send a meaningful status update to the framework.
> 
> Jay Buffington wrote:
>     Thanks for these comments, Jie.  I'll work implementing your proposal.
>     
>     I was unware of the existance of the Termination protobuf.  I am confused 
> by these comments (which I think are just flat out wrong). 
> https://github.com/apache/mesos/blob/c36d5996327ca765f49c211d489371c99ef8e090/src/slave/slave.cpp#L3177
>  which say:
>     
>        // A termination failure indicates the containerizer could not destroy 
> a
>        // container.
>        // TODO(idownes): This is a serious error so consider aborting the 
> slave if
>        // this occurs.
>        
>     I don't understand when you would ever return Failure() in your 
> containerizer without doing container->termination.set(termination);
>     
>     Thanks!
> 
> Jie Yu wrote:
>     > I don't understand when you would ever return Failure() in your 
> containerizer without doing container->termination.set(termination);
>     
>     Yeah, I think that's a tech debt and we should correct it. Let me know if 
> you need any help for this.
> 
> Jay Buffington wrote:
>     I finally sat down to work through what you said, Jie.
>     
>     Check this out:
>     
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1194-L1202
>     
>     When pull() fails destroy() is called.  Destroy correctly sets 
> Termination, but then it deletes the container!  Then when we call 
> containerizer->wait() in slave.cpp, we get back a Failure("Unknown container: 
> ...) instead of the termination future like we should.
>     
>     IMHO the solution here isn't to split up the code from destroy() into to 
> methods: one that sets the termination (call this launchFailed()) and then 
> make destroy() just erase the container.
>     
>     Then we should only call launchFailed() from launch (here: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L636)
>  
>     
>     When launch fails we should call containerizer->destroy() from slave's 
> executorLaunched method.
> 
> Jie Yu wrote:
>     An alternative, can we simply just move 
> `containerizer->wait(containerId)` in `Slave::executorLaunched` to 
> `Framework::launchExecutor` right after `launch.onAny(defer(slave, 
> &Slave::executorLaunched, ...);`?
> 
> Jay Buffington wrote:
>     I spoke to Jie about this.  It turns out that launch and wait need to be 
> serialized to support the composing containerizer.  See 
> https://reviews.apache.org/r/23463   
>     
>     Jie questioned if destroy() really needs to be called on docker 
> containerizer launch failure.  To answer this I looked at what we do in 
> destroy if the containerizer state is <= RUNNING.  
>     
>     If the state is FETCHING destroy() will do fetcher->kill(containerId).  
> It isn't necessary to kill the fetcher if it's already failed.
>     
>     Simliarly, If state is PULLING destroy() does container->pull.discard() 
> which isn't necessary if the pull failed.
>     
>     If state is RUNNING destroy() will try to kill the executor if it has 
> started.  If "docker run" fails we know the executor isn't running.  
> destroy() goes on to attempt to run "docker stop" on the container which was 
> never created.  This results log spam regarding stopping a container that 
> doesn't exist.  
>     
>     Instead of calling destroy() on launch failure, we should call a method 
> that sets the Termination on the container with the proper message and reason 
> set.  We'll still need to erase the container object from the hashmap 
> somewhere and delete it.
>     
>     Thoughts?


It isn't necessary to kill the fetcher if it failed, but if destroy is called 
in the middle of fetch we do want to make sure the fetch is killed, otherwise 
it leaves a long running fetch process running for no reason, similiar to pull.
RUNNING is a more interesting case, since we don't know if it's really still 
running we just know it's passed fetch and pull, so at least we tried to launch 
it, and stop shouldn't cause any problems.

I think it's fine if we don't call destroy on failure, but we do have to make 
sure to call the Termination as you mentioned, and also update the state of the 
container.


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review81090
-----------------------------------------------------------


On May 5, 2015, 12:43 a.m., Jay Buffington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> -----------------------------------------------------------
> 
> (Updated May 5, 2015, 12:43 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020
>     https://issues.apache.org/jira/browse/MESOS-2020
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When mesos is unable to launch the containerizer the scheduler should
> get a TASK_FAILED with a status message that includes the error the
> containerizer encounted when trying to launch.
> 
> Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
>   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
>   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> -------
> 
> I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
> a bogus docker image url and saw the "docker pull" failure stderr message in 
> Aurora's web UI.
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

Reply via email to