Hi Evers,

Thanks for taking this on. Vinod has agreed to shepherd this and I would be 
happy to be the initial reviewer for the patches.

-anand


> On Jun 1, 2016, at 10:27 AM, Evers Benno <[email protected]> wrote:
> 
> Some more context about this bug:
> 
> We did some tests with a framework that does nothing but send empty
> tasks and sample executor that does nothing but send TASK_FINISHED and
> shut itself down.
> 
> Running on two virtual machines on the same host (i.e. no network
> involved), we see TASK_FAILED in about 3% of all tasks (271 out of
> 9000). Adding some megabytes of data into update.data, this can go up
> to 80%. In all cases where I looked manually, the logs look like this:
> (id's shortened to three characters for better readability)
> 
> [...]
> I0502 14:40:33.151075 394179 slave.cpp:3002] Handling status update
> TASK_FINISHED (UUID: 20c) for task 24c of framework f20 from
> executor(1)@[2a02:6b8:0:1a16::165]:49266
> I0502 14:40:33.151175 394179 slave.cpp:3528]
> executor(1)@[2a02:6b8:0:1a16::165]:49266 exited
> I0502 14:40:33.151190 394179 slave.cpp:3886] Executor 'executor_24c' of
> framework f20 exited with status 0
> I0502 14:40:33.151216 394179 slave.cpp:3002] Handling status update
> TASK_FAILED (UUID: 01b) for task 24c of framework f20 from @0.0.0.0:0
> [...]
> 
> The random failure chance is a bit too high to ignore, so we're
> currently writing/testing a patch to wait for confirmations for all
> status updates on executor shutdown.
> 
> It would be great if someone would like to shepherd this.
> 
> Best regards,
> Benno
> 
> On 03.05.2016 14:49, Evers Benno wrote:
>> Hi,
>> 
>> I was wondering about the semantics of the Executor::sendStatusUpdate()
>> method. It is described as
>> 
>>    // Sends a status update to the framework scheduler, retrying as
>>    // necessary until an acknowledgement has been received or the
>>    // executor is terminated (in which case, a TASK_LOST status update
>>    // will be sent). See Scheduler::statusUpdate for more information
>>    // about status update acknowledgements.
>> 
>> I was understanding this to say that the function blocks until an
>> acknowledgement is received, but looking at the implementation of
>> MesosExecutor it seems that "retrying as necessary" only means
>> re-sending of unacknowledged updates when the slave reconnects.
>> (exec/exec.cpp:274)
>> 
>> I'm wondering because we're currently running a python executor which
>> ends its life like this:
>> 
>>    driver.sendStatusUpdate(_create_task_status(TASK_FINISHED))
>>    driver.stop()
>>    # in a different thread:
>>    sys.exit(0 if driver.run() == mesos_pb2.DRIVER_STOPPED else 1)
>> 
>> and we're seeing situations (roughly once per 10,000 tasks) where the
>> executor process is reaped before the acknowledgement for TASK_FINISHED
>> is sent from slave to executor. This results in mesos generating a
>> TASK_FAILED status update, probably from
>> Slave::sendExecutorTerminatedStatusUpdate().
>> 
>> So, did I misunderstand how MesosExecutor works? Or is it indeed a race,
>> and we have to change the executor shutdown?
>> 
>> Best regards,
>> Benno
>> 

Reply via email to