[
https://issues.apache.org/jira/browse/MESOS-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265854#comment-14265854
]
Adam B commented on MESOS-2198:
-------------------------------
[~rlacroix], were you satisfied with the above explanation? If so, then we can
resolve this as Won't Fix.
Or would you still like to see the SchedulerDriver cache recent status updates
(how many?) and try to dedupe? But what if the Scheduler fails over (to another
node)? We'd have to reliably persist the status update cache for the new
scheduler instance to not fall for the same trap. That could get slow,
expensive, and complicated.
The recommended solution would be for the framework to change the Mesos taskID
with each attempt, perhaps by appending "attemptN" to the taskID. Then you can
distinguish between status messages from different attempts, and still
recognize that each attempt comes from the same base taskID.
> Scheduler#statusUpdate should not be called multiple times for the same
> status update
> -------------------------------------------------------------------------------------
>
> Key: MESOS-2198
> URL: https://issues.apache.org/jira/browse/MESOS-2198
> Project: Mesos
> Issue Type: Bug
> Components: framework
> Reporter: Robert Lacroix
>
> Currently Scheduler#statusUpdate can be called multiple times for the same
> status update, for example when the slave retransmits a status update because
> it's not acknowledged in time. Especially for terminal status updates this
> can lead to unexpected scheduler behavior when task id's are being reused.
> Consider this scenario:
> * Scheduler schedules task
> * Task fails, slave sends TASK_FAILED
> * Scheduler is busy and libmesos doesn't acknowledge update in time
> * Slave retransmits TASK_FAILED
> * Scheduler eventually receives first TASK_FAILED and reschedules task
> * Second TASK_FAILED triggers statusUpdate again and the scheduler can't
> determine if the TASK_FAILED belongs to the first or second run of the task.
> It would be a lot better if libmesos would dedupe status updates and only
> call Scheduler#statusUpdate once per status update it received. Retries with
> the same UUID shouldn't cause Scheduler#statusUpdate to be executed again.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)