[ 
https://issues.apache.org/jira/browse/AURORA-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643199#comment-14643199
 ] 

Maxim Khutornenko commented on AURORA-1404:
-------------------------------------------

The response time for stuck ASSIGNED tasks can be improved via AURORA-1370. I 
think it's generally more robust to kill/reschedule an ASSIGNED task instead of 
retrying a {{launchTasks}} call for something that's already in-flight.

> Reconcile ASSIGNED tasks that have not transitioned to STARTING
> ---------------------------------------------------------------
>
>                 Key: AURORA-1404
>                 URL: https://issues.apache.org/jira/browse/AURORA-1404
>             Project: Aurora
>          Issue Type: Task
>          Components: Scheduler
>            Reporter: Joshua Cohen
>
> If the Mesos master fails over between the time that Aurora moves a task to 
> {{ASSIGNED}} but before the slave receives the message, those tasks will 
> never transition and eventually be timed out by 
> [TaskTimeout|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/async/TaskTimeout.java].
> Instead it would be better if we had a mechanism similar to 
> [KillRetry|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/async/KillRetry.java]
>  that ensures assigned tasks have transitioned to a running state, and if not 
> transitions them to {{LOST}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to