Alex,

thanks, that put me on the right track

Seems like the executor driver is indeed not waiting for
acknowledgements before stopping, so, as observed by Yan Xu and Vinod
Kone in MESOS-243 <https://issues.apache.org/jira/browse/MESOS-243>:

> The right fix for this for stop() to wait for ACKs. We will do this
once acks are plumbed through to the executor.

Without doing this, it seems like MESOS-4111 wouldn't prevent this from
happening, since (afaict) the race is between waitpid() and read() on
the slave.

Best regards,
Benno

On 03.05.2016 15:37, Alex Rukletsov wrote:
> Benno—
> 
> you may be seeing MESOS-4111 
> <https://issues.apache.org/jira/browse/MESOS-4111>. Also, have a
> look at this comment: 
> https://github.com/apache/mesos/blob/9f472b1eff904d0d96063d3bed535a8e81263d69/src/launcher/executor.cpp#L611-L617
>
>
> 
On Tue, May 3, 2016 at 2:49 PM, Evers Benno <[email protected]>
> wrote:
> 
>> Hi,
>> 
>> I was wondering about the semantics of the 
>> Executor::sendStatusUpdate() method. It is described as
>> 
>> // Sends a status update to the framework scheduler, retrying as
>> // necessary until an acknowledgement has been received or the // 
>> executor is terminated (in which case, a TASK_LOST status update
>> // will be sent). See Scheduler::statusUpdate for more information
>> // about status update acknowledgements.
>> 
>> I was understanding this to say that the function blocks until an 
>> acknowledgement is received, but looking at the implementation of 
>> MesosExecutor it seems that "retrying as necessary" only means 
>> re-sending of unacknowledged updates when the slave reconnects. 
>> (exec/exec.cpp:274)
>> 
>> I'm wondering because we're currently running a python executor 
>> which ends its life like this:
>> 
>> driver.sendStatusUpdate(_create_task_status(TASK_FINISHED)) 
>> driver.stop() # in a different thread: sys.exit(0 if driver.run() 
>> == mesos_pb2.DRIVER_STOPPED else 1)
>> 
>> and we're seeing situations (roughly once per 10,000 tasks) where 
>> the executor process is reaped before the acknowledgement for 
>> TASK_FINISHED is sent from slave to executor. This results in
>> mesos generating a TASK_FAILED status update, probably from 
>> Slave::sendExecutorTerminatedStatusUpdate().
>> 
>> So, did I misunderstand how MesosExecutor works? Or is it indeed a 
>> race, and we have to change the executor shutdown?
>> 
>> Best regards, Benno
>> 
> 

Reply via email to