Re: Review Request 21424: Fixed orphaned container handling in the ExternalContainerizer recover implementation.

Till Toenshoff Sun, 25 May 2014 13:52:26 -0700


> On May 24, 2014, 9:09 p.m., Niklas Nielsen wrote:
> > Modulo a clarifying comment.
> 
> Niklas Nielsen wrote:
>     Just took this patch for a spin and I am running into problems with EC 
> recovery during startup. Would you mind reaching out before committing?
> 
> Till Toenshoff wrote:
>     Thanks for doing that double check. I am now testing it again and also 
> trying to reach you on the usual channels...

I am pretty sure you missed to apply 21677 before testing this RR. The former 
RR is mandatory to get any kind of slave recovery to work with the EC. Maybe I 
should have supplied it as a dependency - sry for that.

Was the problem you saw related to the following output of the slave while 
recovering?
"Failed to get forked pid for executor XXXX of framework XXXX"

> On May 24, 2014, 9:09 p.m., Niklas Nielsen wrote:
> > src/slave/containerizer/external_containerizer.cpp, line 409
> > <https://reviews.apache.org/r/21424/diff/4/?file=592590#file592590line409>
> >
> >     This works (ensures task_lost updates are sent) because wait() returns 
> > immediately in the slave and cause executorTerminated to be called? If so, 
> > maybe worth a comment :)

No, it is not exactly that way.
The slave uses executorTerminated as a continuation of its own calls towards a 
containerizer->wait (slave.cpp:2920 for 'Slave::recover' and slave.cpp:3376 for 
'Framework::launchExecutor'). During the EC's orphan cleanup however, the slave 
did not initiate that 'wait', it is the EC himself who does 
(external_containerizer.cpp:402). The EC invokes 'wait' only for getting a 
confirmation when the orphan destruction is done - to delay the containerizer 
'recover' future satisfaction until then. So there is no callback towards the 
slave's executorTerminated due to that specific 'destroy'.

Still, the question is, is there a status feedback (TASK_LOST updates) for 
those orphans; 
No, I think all we get is a TASK_KILLED via the StatusUpdateMessage sent from 
the Executor(Driver) once the command got reaped (), which should follow a 
destroy.

> On May 24, 2014, 9:09 p.m., Niklas Nielsen wrote:
> > src/slave/containerizer/external_containerizer.cpp, lines 423-427
> > <https://reviews.apache.org/r/21424/diff/4/?file=592590#file592590line423>
> >
> >     collect() returns Future<Nothing> and when we don't do anything with 
> > the future in the continuation (other than a log message), how about 
> > flattening it?
> 
> Till Toenshoff wrote:
>     Ow good point indeed, thanks!

Actually: 
template <typename T>
Future<std::list<T> > collect(const std::list<Future<T> >& futures);

So collect returns a future of a list of containers, or did I miss something?

- Till

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21424/#review43901
-----------------------------------------------------------

On May 24, 2014, 12:14 a.m., Till Toenshoff wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21424/
> -----------------------------------------------------------
> 
> (Updated May 24, 2014, 12:14 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1364
>     https://issues.apache.org/jira/browse/MESOS-1364
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> An orphaned container is known to the ECP but not to the EC, thus not 
> recoverable but pending. This patch enforces a call to destroy for any orphan 
> that has been identified as such during the recovery phase.
> 
> NOTE: Such destroy is wrapped by a call to "wait" to make sure the 
> ExternalContainerizer gets to know when a container was destroyed 
> successfully.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/external_containerizer.hpp 7e5474c 
>   src/slave/containerizer/external_containerizer.cpp ac3dd18 
> 
> Diff: https://reviews.apache.org/r/21424/diff/
> 
> 
> Testing
> -------
> 
> make check against upcoming SlaveRecoveryTests (enabled locally)
> 
> 
> Thanks,
> 
> Till Toenshoff
> 
>

Re: Review Request 21424: Fixed orphaned container handling in the ExternalContainerizer recover implementation.

Reply via email to