> On May 3, 2013, 7:37 p.m., Vinod Kone wrote: > > src/slave/slave.cpp, lines 1085-1089 > > <https://reviews.apache.org/r/10932/diff/1/?file=287641#file287641line1085> > > > > this seems like the wrong thing to do. an executor can run more than > > one task. why do you want to kill the executor if it could get more tasks? > > Brenden Matthews wrote: > I think this is a bug. > > I've had many cases where the executor launches, starts a task, and the > task is killed before it has finished launching. This results in the task > continuing to run indefinitely or until the mesos slave process is restarted. > > Vinod Kone wrote: > I don't think I follow the sequence of events here. Is it as follows? > > --> Slave launches an executor > --> Before the executor registers with the slave, the framework asks to > kill a task (do you know why?) > --> When the executor registers it doesn't get any task from the slave > --> The executor is running without any task. > > I don't understand what do you mean by "task continuing to run > indefinitely". Do you mean "the executor runs indefinitely"? If its the > latter, it seems the right semantics for a general purpose executor. Am I > missing something? > > Brenden Matthews wrote: > I'm sorry, I realize that wasn't very clear. I went digging for logs but > I can't find an example (it seems to have been all rotated out). > > And yes, that sounds correct. > > I'm not actually sure what the cause of this is. The Hadoop scheduler > will occasionally kill tasks, so it could be that (but I haven't scoured the > logs to determine the cause).
I agree with Brenden here that this is unexpected. Currently, all executors have to handle the case where they start and _never_ receive a launchTask. That seems broken to me, since the expectation is that we've launched the executor in order to launch a task in the first place. After talking with Vinod I think there are two ways to "fix" this: 1) In the ExecutorDriver, create a timeout to commit suicide when no launch task is received within, say, 10 seconds of registering with the slave. 2) Send the launch task to the executor anyway, immediately followed by the kill task request. This is tricky. 3) Leave as is and have the MesosExecutor for Hadoop commit suicide if no task is received within 10 seconds of registration. Again, this only fixes the issue for the Hadoop executor. So 1 seems to be the best option here. Other thoughts Vinod? Brenden, do you want to take that on, or file an issue? - Ben ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/10932/#review20136 ----------------------------------------------------------- On May 3, 2013, 6:42 p.m., Brenden Matthews wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/10932/ > ----------------------------------------------------------- > > (Updated May 3, 2013, 6:42 p.m.) > > > Review request for mesos. > > > Description > ------- > > From 607072595b91993e2d47251ee841fb3dc5d84e05 Mon Sep 17 00:00:00 2001 > From: Brenden Matthews <[email protected]> > Date: Fri, 3 May 2013 09:47:22 -0700 > Subject: [PATCH 8/9] Terminate executors that aren't needed. > > If we launch an executor and then kill the task immediately after, make > sure we also terminate the executor when there are no other tasks. > --- > src/slave/slave.cpp | 48 +++++++++++++++++++++++++++--------------------- > 1 file changed, 27 insertions(+), 21 deletions(-) > > > Diffs > ----- > > src/slave/slave.cpp 86a15fc > > Diff: https://reviews.apache.org/r/10932/diff/ > > > Testing > ------- > > Used in production at airbnb. > > > Thanks, > > Brenden Matthews > >
