-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65339/#review196230
-----------------------------------------------------------


Ship it!




Master (dbe7137) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Jan. 25, 2018, 10:33 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65339/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 10:33 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1966
>     https://issues.apache.org/jira/browse/AURORA-1966
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> As reported in https://issues.apache.org/jira/browse/AURORA-1966, Mesos sends 
> a TASK_UNKNOWN when we try to kill (or reconcile) tasks that are unknown. On 
> master, this leads to an infinite loop. The sequence of events is:
> 
> 1) We map TASK_UNKNOWN to PARTITIONED
> 2) We react to restarting or terminal -> PARTITIONED state by telling Mesos 
> "that is a bad state transition, that task should be dead".
> 3) Mesos replies with: that task is TASK_UNKNOWN
> 4) GO TO 1
> 
> AURORA-1966 describes just one case of this happening, but there are many 
> other legitimate paths to this. 
> 
> This patch cleans up the logic. The two main changes:
> 
> 1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this 
> bug, but I found this logic error during debugging. ASSIGNED is a transient 
> state and is subject to the transient task timeout in the Scheduler, so we 
> should not attempt to move to PARTITIONED during that window. 
> 2) Do not try to kill tasks we think are terminal when Mesos tells us they 
> are unknown. Originally we did this because "manageTerminalTasks" is also 
> used for restarting tasks - but in both cases it never makes sense to respond 
>  to "I don't know about that task" with a request to kill it.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java 
> b8ba5da729fcf5965b577c23e3062e5607bd07e7 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 
> 3d98fe651ad2b89a03044e8a06953a0cea876321 
> 
> 
> Diff: https://reviews.apache.org/r/65339/diff/2/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> Verified this fixes the issue reported in AURORA-1966 by forcing 
> LaunchException in OfferManagerImpl in my vagrant image and viewing logs.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>

Reply via email to