[jira] [Commented] (MESOS-2198) Scheduler#statusUpdate should not be called multiple times for the same status update

Adam B (JIRA) Tue, 06 Jan 2015 00:39:49 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265854#comment-14265854
 ]


Adam B commented on MESOS-2198:
-------------------------------

[~rlacroix], were you satisfied with the above explanation? If so, then we can 
resolve this as Won't Fix.
Or would you still like to see the SchedulerDriver cache recent status updates 
(how many?) and try to dedupe? But what if the Scheduler fails over (to another 
node)? We'd have to reliably persist the status update cache for the new 
scheduler instance to not fall for the same trap. That could get slow, 
expensive, and complicated.

The recommended solution would be for the framework to change the Mesos taskID 
with each attempt, perhaps by appending "attemptN" to the taskID. Then you can 
distinguish between status messages from different attempts, and still 
recognize that each attempt comes from the same base taskID.

> Scheduler#statusUpdate should not be called multiple times for the same 
> status update
> -------------------------------------------------------------------------------------
>
>                 Key: MESOS-2198
>                 URL: https://issues.apache.org/jira/browse/MESOS-2198
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Robert Lacroix
>
> Currently Scheduler#statusUpdate can be called multiple times for the same 
> status update, for example when the slave retransmits a status update because 
> it's not acknowledged in time. Especially for terminal status updates this 
> can lead to unexpected scheduler behavior when task id's are being reused.
> Consider this scenario:
> * Scheduler schedules task
> * Task fails, slave sends TASK_FAILED
> * Scheduler is busy and libmesos doesn't acknowledge update in time
> * Slave retransmits TASK_FAILED
> * Scheduler eventually receives first TASK_FAILED and reschedules task
> * Second TASK_FAILED triggers statusUpdate again and the scheduler can't 
> determine if the TASK_FAILED belongs to the first or second run of the task.
> It would be a lot better if libmesos would dedupe status updates and only 
> call Scheduler#statusUpdate once per status update it received. Retries with 
> the same UUID shouldn't cause Scheduler#statusUpdate to be executed again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2198) Scheduler#statusUpdate should not be called multiple times for the same status update

Reply via email to