Hi Evers, Thanks for taking this on. Vinod has agreed to shepherd this and I would be happy to be the initial reviewer for the patches.
-anand > On Jun 1, 2016, at 10:27 AM, Evers Benno <[email protected]> wrote: > > Some more context about this bug: > > We did some tests with a framework that does nothing but send empty > tasks and sample executor that does nothing but send TASK_FINISHED and > shut itself down. > > Running on two virtual machines on the same host (i.e. no network > involved), we see TASK_FAILED in about 3% of all tasks (271 out of > 9000). Adding some megabytes of data into update.data, this can go up > to 80%. In all cases where I looked manually, the logs look like this: > (id's shortened to three characters for better readability) > > [...] > I0502 14:40:33.151075 394179 slave.cpp:3002] Handling status update > TASK_FINISHED (UUID: 20c) for task 24c of framework f20 from > executor(1)@[2a02:6b8:0:1a16::165]:49266 > I0502 14:40:33.151175 394179 slave.cpp:3528] > executor(1)@[2a02:6b8:0:1a16::165]:49266 exited > I0502 14:40:33.151190 394179 slave.cpp:3886] Executor 'executor_24c' of > framework f20 exited with status 0 > I0502 14:40:33.151216 394179 slave.cpp:3002] Handling status update > TASK_FAILED (UUID: 01b) for task 24c of framework f20 from @0.0.0.0:0 > [...] > > The random failure chance is a bit too high to ignore, so we're > currently writing/testing a patch to wait for confirmations for all > status updates on executor shutdown. > > It would be great if someone would like to shepherd this. > > Best regards, > Benno > > On 03.05.2016 14:49, Evers Benno wrote: >> Hi, >> >> I was wondering about the semantics of the Executor::sendStatusUpdate() >> method. It is described as >> >> // Sends a status update to the framework scheduler, retrying as >> // necessary until an acknowledgement has been received or the >> // executor is terminated (in which case, a TASK_LOST status update >> // will be sent). See Scheduler::statusUpdate for more information >> // about status update acknowledgements. >> >> I was understanding this to say that the function blocks until an >> acknowledgement is received, but looking at the implementation of >> MesosExecutor it seems that "retrying as necessary" only means >> re-sending of unacknowledged updates when the slave reconnects. >> (exec/exec.cpp:274) >> >> I'm wondering because we're currently running a python executor which >> ends its life like this: >> >> driver.sendStatusUpdate(_create_task_status(TASK_FINISHED)) >> driver.stop() >> # in a different thread: >> sys.exit(0 if driver.run() == mesos_pb2.DRIVER_STOPPED else 1) >> >> and we're seeing situations (roughly once per 10,000 tasks) where the >> executor process is reaped before the acknowledgement for TASK_FINISHED >> is sent from slave to executor. This results in mesos generating a >> TASK_FAILED status update, probably from >> Slave::sendExecutorTerminatedStatusUpdate(). >> >> So, did I misunderstand how MesosExecutor works? Or is it indeed a race, >> and we have to change the executor shutdown? >> >> Best regards, >> Benno >>
