Also, we would be modifying the agent to always acknowledge status updates from 
the executor. (MESOS-5262 <https://issues.apache.org/jira/browse/MESOS-5262>)

Once, that is done, it should be sufficient for an executor to terminate itself 
on receiving an acknowledgment message from the agent, instead of relying on 
the best effort hack of sleeping for some duration.

-anand

> On May 3, 2016, at 6:37 AM, Alex Rukletsov <[email protected]> wrote:
> 
> Benno—
> 
> you may be seeing MESOS-4111
> <https://issues.apache.org/jira/browse/MESOS-4111>. Also, have a look at
> this comment:
> https://github.com/apache/mesos/blob/9f472b1eff904d0d96063d3bed535a8e81263d69/src/launcher/executor.cpp#L611-L617
> 
> On Tue, May 3, 2016 at 2:49 PM, Evers Benno <[email protected]> wrote:
> 
>> Hi,
>> 
>> I was wondering about the semantics of the Executor::sendStatusUpdate()
>> method. It is described as
>> 
>>    // Sends a status update to the framework scheduler, retrying as
>>    // necessary until an acknowledgement has been received or the
>>    // executor is terminated (in which case, a TASK_LOST status update
>>    // will be sent). See Scheduler::statusUpdate for more information
>>    // about status update acknowledgements.
>> 
>> I was understanding this to say that the function blocks until an
>> acknowledgement is received, but looking at the implementation of
>> MesosExecutor it seems that "retrying as necessary" only means
>> re-sending of unacknowledged updates when the slave reconnects.
>> (exec/exec.cpp:274)
>> 
>> I'm wondering because we're currently running a python executor which
>> ends its life like this:
>> 
>>    driver.sendStatusUpdate(_create_task_status(TASK_FINISHED))
>>    driver.stop()
>>    # in a different thread:
>>    sys.exit(0 if driver.run() == mesos_pb2.DRIVER_STOPPED else 1)
>> 
>> and we're seeing situations (roughly once per 10,000 tasks) where the
>> executor process is reaped before the acknowledgement for TASK_FINISHED
>> is sent from slave to executor. This results in mesos generating a
>> TASK_FAILED status update, probably from
>> Slave::sendExecutorTerminatedStatusUpdate().
>> 
>> So, did I misunderstand how MesosExecutor works? Or is it indeed a race,
>> and we have to change the executor shutdown?
>> 
>> Best regards,
>> Benno
>> 

Reply via email to