> On May 24, 2014, 9:09 p.m., Niklas Nielsen wrote: > > Modulo a clarifying comment. > > Niklas Nielsen wrote: > Just took this patch for a spin and I am running into problems with EC > recovery during startup. Would you mind reaching out before committing? > > Till Toenshoff wrote: > Thanks for doing that double check. I am now testing it again and also > trying to reach you on the usual channels...
I am pretty sure you missed to apply 21677 before testing this RR. The former RR is mandatory to get any kind of slave recovery to work with the EC. Maybe I should have supplied it as a dependency - sry for that. Was the problem you saw related to the following output of the slave while recovering? "Failed to get forked pid for executor XXXX of framework XXXX" > On May 24, 2014, 9:09 p.m., Niklas Nielsen wrote: > > src/slave/containerizer/external_containerizer.cpp, line 409 > > <https://reviews.apache.org/r/21424/diff/4/?file=592590#file592590line409> > > > > This works (ensures task_lost updates are sent) because wait() returns > > immediately in the slave and cause executorTerminated to be called? If so, > > maybe worth a comment :) No, it is not exactly that way. The slave uses executorTerminated as a continuation of its own calls towards a containerizer->wait (slave.cpp:2920 for 'Slave::recover' and slave.cpp:3376 for 'Framework::launchExecutor'). During the EC's orphan cleanup however, the slave did not initiate that 'wait', it is the EC himself who does (external_containerizer.cpp:402). The EC invokes 'wait' only for getting a confirmation when the orphan destruction is done - to delay the containerizer 'recover' future satisfaction until then. So there is no callback towards the slave's executorTerminated due to that specific 'destroy'. Still, the question is, is there a status feedback (TASK_LOST updates) for those orphans; No, I think all we get is a TASK_KILLED via the StatusUpdateMessage sent from the Executor(Driver) once the command got reaped (), which should follow a destroy. > On May 24, 2014, 9:09 p.m., Niklas Nielsen wrote: > > src/slave/containerizer/external_containerizer.cpp, lines 423-427 > > <https://reviews.apache.org/r/21424/diff/4/?file=592590#file592590line423> > > > > collect() returns Future<Nothing> and when we don't do anything with > > the future in the continuation (other than a log message), how about > > flattening it? > > Till Toenshoff wrote: > Ow good point indeed, thanks! Actually: template <typename T> Future<std::list<T> > collect(const std::list<Future<T> >& futures); So collect returns a future of a list of containers, or did I miss something? - Till ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/21424/#review43901 ----------------------------------------------------------- On May 24, 2014, 12:14 a.m., Till Toenshoff wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/21424/ > ----------------------------------------------------------- > > (Updated May 24, 2014, 12:14 a.m.) > > > Review request for mesos and Niklas Nielsen. > > > Bugs: MESOS-1364 > https://issues.apache.org/jira/browse/MESOS-1364 > > > Repository: mesos-git > > > Description > ------- > > An orphaned container is known to the ECP but not to the EC, thus not > recoverable but pending. This patch enforces a call to destroy for any orphan > that has been identified as such during the recovery phase. > > NOTE: Such destroy is wrapped by a call to "wait" to make sure the > ExternalContainerizer gets to know when a container was destroyed > successfully. > > > Diffs > ----- > > src/slave/containerizer/external_containerizer.hpp 7e5474c > src/slave/containerizer/external_containerizer.cpp ac3dd18 > > Diff: https://reviews.apache.org/r/21424/diff/ > > > Testing > ------- > > make check against upcoming SlaveRecoveryTests (enabled locally) > > > Thanks, > > Till Toenshoff > >
