-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65339/
-----------------------------------------------------------

(Updated Jan. 25, 2018, 9:33 a.m.)


Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.


Bugs: AURORA-1966
    https://issues.apache.org/jira/browse/AURORA-1966


Repository: aurora


Description
-------

As reported in https://issues.apache.org/jira/browse/AURORA-1966, Mesos sends a 
TASK_UNKNOWN when we try to kill (or reconcile) tasks that are unknown. On 
master, this leads to an infinite loop. The sequence of events is:

1) We map TASK_UNKNOWN to PARTITIONED
2) We react to restarting or terminal -> PARTITIONED state by telling Mesos 
"that is a bad state transition, that task should be dead".
3) Mesos replies with: that task is TASK_UNKNOWN
4) GO TO 1

AURORA-1966 describes just one case of this happening, but there are many other 
legitimate paths to this. 

This patch cleans up the logic. The two main changes:

1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this 
bug, but I found this logic error during debugging. ASSIGNED is a transient 
state and is subject to the transient task timeout in the Scheduler, so we 
should not attempt to move to PARTITIONED during that window. 
2) Do not try to kill tasks we think are terminal when Mesos tells us they are 
unknown. Originally we did this because "manageTerminalTasks" is also used for 
restarting tasks - but in both cases it never makes sense to respond  to "I 
don't know about that task" with a request to kill it.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java 
b8ba5da729fcf5965b577c23e3062e5607bd07e7 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 
3d98fe651ad2b89a03044e8a06953a0cea876321 


Diff: https://reviews.apache.org/r/65339/diff/2/

Changes: https://reviews.apache.org/r/65339/diff/1-2/


Testing
-------

./gradlew test

Verified this fixes the issue reported in AURORA-1966 by forcing 
LaunchException in OfferManagerImpl in my vagrant image and viewing logs.


Thanks,

David McLaughlin

Reply via email to