Hi,
I was wondering about the semantics of the Executor::sendStatusUpdate()
method. It is described as
// Sends a status update to the framework scheduler, retrying as
// necessary until an acknowledgement has been received or the
// executor is terminated (in which case, a TASK_LOST status update
// will be sent). See Scheduler::statusUpdate for more information
// about status update acknowledgements.
I was understanding this to say that the function blocks until an
acknowledgement is received, but looking at the implementation of
MesosExecutor it seems that "retrying as necessary" only means
re-sending of unacknowledged updates when the slave reconnects.
(exec/exec.cpp:274)
I'm wondering because we're currently running a python executor which
ends its life like this:
driver.sendStatusUpdate(_create_task_status(TASK_FINISHED))
driver.stop()
# in a different thread:
sys.exit(0 if driver.run() == mesos_pb2.DRIVER_STOPPED else 1)
and we're seeing situations (roughly once per 10,000 tasks) where the
executor process is reaped before the acknowledgement for TASK_FINISHED
is sent from slave to executor. This results in mesos generating a
TASK_FAILED status update, probably from
Slave::sendExecutorTerminatedStatusUpdate().
So, did I misunderstand how MesosExecutor works? Or is it indeed a race,
and we have to change the executor shutdown?
Best regards,
Benno