> On May 3, 2013, 7:37 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, lines 1085-1089
> > <https://reviews.apache.org/r/10932/diff/1/?file=287641#file287641line1085>
> >
> > this seems like the wrong thing to do. an executor can run more than
> > one task. why do you want to kill the executor if it could get more tasks?
>
> Brenden Matthews wrote:
> I think this is a bug.
>
> I've had many cases where the executor launches, starts a task, and the
> task is killed before it has finished launching. This results in the task
> continuing to run indefinitely or until the mesos slave process is restarted.
>
> Vinod Kone wrote:
> I don't think I follow the sequence of events here. Is it as follows?
>
> --> Slave launches an executor
> --> Before the executor registers with the slave, the framework asks to
> kill a task (do you know why?)
> --> When the executor registers it doesn't get any task from the slave
> --> The executor is running without any task.
>
> I don't understand what do you mean by "task continuing to run
> indefinitely". Do you mean "the executor runs indefinitely"? If its the
> latter, it seems the right semantics for a general purpose executor. Am I
> missing something?
>
> Brenden Matthews wrote:
> I'm sorry, I realize that wasn't very clear. I went digging for logs but
> I can't find an example (it seems to have been all rotated out).
>
> And yes, that sounds correct.
>
> I'm not actually sure what the cause of this is. The Hadoop scheduler
> will occasionally kill tasks, so it could be that (but I haven't scoured the
> logs to determine the cause).
>
> Ben Mahler wrote:
> I agree with Brenden here that this is unexpected. Currently, all
> executors have to handle the case where they start and _never_ receive a
> launchTask. That seems broken to me, since the expectation is that we've
> launched the executor in order to launch a task in the first place.
>
> After talking with Vinod I think there are two ways to "fix" this:
>
> 1) In the ExecutorDriver, create a timeout to commit suicide when no
> launch task is received within, say, 10 seconds of registering with the slave.
>
> 2) Send the launch task to the executor anyway, immediately followed by
> the kill task request. This is tricky.
>
> 3) Leave as is and have the MesosExecutor for Hadoop commit suicide if no
> task is received within 10 seconds of registration. Again, this only fixes
> the issue for the Hadoop executor.
>
> So 1 seems to be the best option here. Other thoughts Vinod? Brenden, do
> you want to take that on, or file an issue?
>
> Brenden Matthews wrote:
> I wrote a quick patch, if I understand your proposal correctly.
>
> Vinod Kone wrote:
> Sorry for the back and forth Brenden. After talking with Ben Mahler and
> looking at your diffs, I would actually prefer your first solution because
> its very simple.
>
> I didn't realize that you were only killing the executor if it hasn't
> registered yet. Since we only launch an executor if there is a task for it, I
> think its actually fine to send a kill when queued and launched tasks are
> empty and the executor hasn't registered yet.
>
> that said, we need some re-formatting on your 1st diff :). Our formatting
> rule is to put all args on the second line if that fits within 80, otherwise
> each arg on a different line.
>
> dispatch(
> isolator, &Isolator::killExecutor, framework->id, executor->id);
>
> @benh do you have any comments?
>
>
#1 is precisely what twitter's executor does:
in executor's main():
# Create driver stub
driver = mesos.MesosExecutorDriver(thermos_executor)
# This is an ephemeral executor -- shutdown if we receive no tasks within a
certain
# time period
ThermosExecutorTimer(thermos_executor, driver).start()
# Start executor.
driver.run()
ThermosExecutorTimer:
class ThermosExecutorTimer(ExceptionalThread):
EXECUTOR_TIMEOUT = Amount(10, Time.SECONDS)
def __init__(self, executor, driver):
self._executor = executor
self._driver = driver
super(ThermosExecutorTimer, self).__init__()
self.daemon = True
def run(self):
self._executor.launched.wait(self.EXECUTOR_TIMEOUT.as_(Time.SECONDS))
if not self._executor.launched.is_set():
self._executor.log('Executor timing out on lack of launchTask.')
self._driver.stop()
- Brian
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10932/#review20136
-----------------------------------------------------------
On May 6, 2013, 7:07 p.m., Brenden Matthews wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10932/
> -----------------------------------------------------------
>
> (Updated May 6, 2013, 7:07 p.m.)
>
>
> Review request for mesos.
>
>
> Description
> -------
>
> From 607072595b91993e2d47251ee841fb3dc5d84e05 Mon Sep 17 00:00:00 2001
> From: Brenden Matthews <[email protected]>
> Date: Fri, 3 May 2013 09:47:22 -0700
> Subject: [PATCH 8/9] Terminate executors that aren't needed.
>
> If we launch an executor and then kill the task immediately after, make
> sure we also terminate the executor when there are no other tasks.
> ---
> src/slave/slave.cpp | 48 +++++++++++++++++++++++++++---------------------
> 1 file changed, 27 insertions(+), 21 deletions(-)
>
>
> Diffs
> -----
>
> include/mesos/executor.hpp 9b25834
> src/exec/exec.cpp 1f022ca
>
> Diff: https://reviews.apache.org/r/10932/diff/
>
>
> Testing
> -------
>
> Used in production at airbnb.
>
>
> Thanks,
>
> Brenden Matthews
>
>