> On May 3, 2013, 7:37 p.m., Vinod Kone wrote: > > src/slave/slave.cpp, lines 1085-1089 > > <https://reviews.apache.org/r/10932/diff/1/?file=287641#file287641line1085> > > > > this seems like the wrong thing to do. an executor can run more than > > one task. why do you want to kill the executor if it could get more tasks? > > Brenden Matthews wrote: > I think this is a bug. > > I've had many cases where the executor launches, starts a task, and the > task is killed before it has finished launching. This results in the task > continuing to run indefinitely or until the mesos slave process is restarted. > > Vinod Kone wrote: > I don't think I follow the sequence of events here. Is it as follows? > > --> Slave launches an executor > --> Before the executor registers with the slave, the framework asks to > kill a task (do you know why?) > --> When the executor registers it doesn't get any task from the slave > --> The executor is running without any task. > > I don't understand what do you mean by "task continuing to run > indefinitely". Do you mean "the executor runs indefinitely"? If its the > latter, it seems the right semantics for a general purpose executor. Am I > missing something? > > Brenden Matthews wrote: > I'm sorry, I realize that wasn't very clear. I went digging for logs but > I can't find an example (it seems to have been all rotated out). > > And yes, that sounds correct. > > I'm not actually sure what the cause of this is. The Hadoop scheduler > will occasionally kill tasks, so it could be that (but I haven't scoured the > logs to determine the cause). > > Ben Mahler wrote: > I agree with Brenden here that this is unexpected. Currently, all > executors have to handle the case where they start and _never_ receive a > launchTask. That seems broken to me, since the expectation is that we've > launched the executor in order to launch a task in the first place. > > After talking with Vinod I think there are two ways to "fix" this: > > 1) In the ExecutorDriver, create a timeout to commit suicide when no > launch task is received within, say, 10 seconds of registering with the slave. > > 2) Send the launch task to the executor anyway, immediately followed by > the kill task request. This is tricky. > > 3) Leave as is and have the MesosExecutor for Hadoop commit suicide if no > task is received within 10 seconds of registration. Again, this only fixes > the issue for the Hadoop executor. > > So 1 seems to be the best option here. Other thoughts Vinod? Brenden, do > you want to take that on, or file an issue?
I wrote a quick patch, if I understand your proposal correctly. - Brenden ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/10932/#review20136 ----------------------------------------------------------- On May 6, 2013, 7:03 p.m., Brenden Matthews wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/10932/ > ----------------------------------------------------------- > > (Updated May 6, 2013, 7:03 p.m.) > > > Review request for mesos. > > > Description > ------- > > From 607072595b91993e2d47251ee841fb3dc5d84e05 Mon Sep 17 00:00:00 2001 > From: Brenden Matthews <[email protected]> > Date: Fri, 3 May 2013 09:47:22 -0700 > Subject: [PATCH 8/9] Terminate executors that aren't needed. > > If we launch an executor and then kill the task immediately after, make > sure we also terminate the executor when there are no other tasks. > --- > src/slave/slave.cpp | 48 +++++++++++++++++++++++++++--------------------- > 1 file changed, 27 insertions(+), 21 deletions(-) > > > Diffs > ----- > > include/mesos/executor.hpp 9b25834 > src/exec/exec.cpp 1f022ca > > Diff: https://reviews.apache.org/r/10932/diff/ > > > Testing > ------- > > Used in production at airbnb. > > > Thanks, > > Brenden Matthews > >
