----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65339/#review196230 -----------------------------------------------------------
Ship it! Master (dbe7137) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing "@ReviewBot retry" - Aurora ReviewBot On Jan. 25, 2018, 10:33 a.m., David McLaughlin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65339/ > ----------------------------------------------------------- > > (Updated Jan. 25, 2018, 10:33 a.m.) > > > Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham. > > > Bugs: AURORA-1966 > https://issues.apache.org/jira/browse/AURORA-1966 > > > Repository: aurora > > > Description > ------- > > As reported in https://issues.apache.org/jira/browse/AURORA-1966, Mesos sends > a TASK_UNKNOWN when we try to kill (or reconcile) tasks that are unknown. On > master, this leads to an infinite loop. The sequence of events is: > > 1) We map TASK_UNKNOWN to PARTITIONED > 2) We react to restarting or terminal -> PARTITIONED state by telling Mesos > "that is a bad state transition, that task should be dead". > 3) Mesos replies with: that task is TASK_UNKNOWN > 4) GO TO 1 > > AURORA-1966 describes just one case of this happening, but there are many > other legitimate paths to this. > > This patch cleans up the logic. The two main changes: > > 1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this > bug, but I found this logic error during debugging. ASSIGNED is a transient > state and is subject to the transient task timeout in the Scheduler, so we > should not attempt to move to PARTITIONED during that window. > 2) Do not try to kill tasks we think are terminal when Mesos tells us they > are unknown. Originally we did this because "manageTerminalTasks" is also > used for restarting tasks - but in both cases it never makes sense to respond > to "I don't know about that task" with a request to kill it. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java > b8ba5da729fcf5965b577c23e3062e5607bd07e7 > src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java > 3d98fe651ad2b89a03044e8a06953a0cea876321 > > > Diff: https://reviews.apache.org/r/65339/diff/2/ > > > Testing > ------- > > ./gradlew test > > Verified this fixes the issue reported in AURORA-1966 by forcing > LaunchException in OfferManagerImpl in my vagrant image and viewing logs. > > > Thanks, > > David McLaughlin > >