-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65339/
-----------------------------------------------------------
(Updated Jan. 25, 2018, 9:33 a.m.)
Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
Bugs: AURORA-1966
https://issues.apache.org/jira/browse/AURORA-1966
Repository: aurora
Description
-------
As reported in https://issues.apache.org/jira/browse/AURORA-1966, Mesos sends a
TASK_UNKNOWN when we try to kill (or reconcile) tasks that are unknown. On
master, this leads to an infinite loop. The sequence of events is:
1) We map TASK_UNKNOWN to PARTITIONED
2) We react to restarting or terminal -> PARTITIONED state by telling Mesos
"that is a bad state transition, that task should be dead".
3) Mesos replies with: that task is TASK_UNKNOWN
4) GO TO 1
AURORA-1966 describes just one case of this happening, but there are many other
legitimate paths to this.
This patch cleans up the logic. The two main changes:
1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this
bug, but I found this logic error during debugging. ASSIGNED is a transient
state and is subject to the transient task timeout in the Scheduler, so we
should not attempt to move to PARTITIONED during that window.
2) Do not try to kill tasks we think are terminal when Mesos tells us they are
unknown. Originally we did this because "manageTerminalTasks" is also used for
restarting tasks - but in both cases it never makes sense to respond to "I
don't know about that task" with a request to kill it.
Diffs (updated)
-----
src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java
b8ba5da729fcf5965b577c23e3062e5607bd07e7
src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java
3d98fe651ad2b89a03044e8a06953a0cea876321
Diff: https://reviews.apache.org/r/65339/diff/2/
Changes: https://reviews.apache.org/r/65339/diff/1-2/
Testing
-------
./gradlew test
Verified this fixes the issue reported in AURORA-1966 by forcing
LaunchException in OfferManagerImpl in my vagrant image and viewing logs.
Thanks,
David McLaughlin