> On Jan. 25, 2018, 2:59 a.m., Santhosh Kumar Shanmugham wrote: > > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java > > Lines 333 (patched) > > <https://reviews.apache.org/r/65281/diff/2/?file=1946477#file1946477line333> > > > > We should only consider `UP` if the previous state is also `UP` for > > `PARTITIONED` state. For instance, `KILLING` -> `PARTITIONED` should we > > counted as `REMOVED`. > > David McLaughlin wrote: > You cannot move from KILLING to PARTITIONED. > > Santhosh Kumar Shanmugham wrote: > I was able to trigger a transition from KILLING to PARTITIONED. > > - Creating job > ``` > vagrant@aurora:~$ aurora job create > devcluster/vagrant/test/partition_aware_disabled > aurora/src/test/sh/org/apache/aurora/e2e/partition_aware.aurora > INFO] Creating job partition_aware_disabled > INFO] Checking status of devcluster/vagrant/test/partition_aware_disabled > Job create succeeded: job > url=http://aurora.local:8081/scheduler/vagrant/test/partition_aware_disabled > ``` > > - Checking Status > ``` > vagrant@aurora:~$ aurora job status devcluster > INFO] Retrieving jobs for role None > INFO] Checking status of devcluster/vagrant/test/partition_aware_disabled > Active tasks (1): > Task role: vagrant, env: test, name: partition_aware_disabled, > instance: 0, status: RUNNING on 192.168.33.7 > CPU: 0.2 core(s), RAM: 1 MB, Disk: 8 MB > events: > 2018-01-25 04:07:08 PENDING: None > 2018-01-25 04:07:08 ASSIGNED: None > 2018-01-25 04:07:09 STARTING: None > 2018-01-25 04:07:10 RUNNING: No health-check defined, task is > assumed healthy. > ``` > > - Create Partition > ``` > vagrant@aurora:~$ sudo stop mesos-slave > mesos-slave stop/waiting > ``` > > - Kill task > ``` > vagrant@aurora:~$ aurora job killall > devcluster/vagrant/test/partition_aware_disabled > INFO] Killing tasks for job: > devcluster/vagrant/test/partition_aware_disabled > INFO] Instances to be killed: [0] > Instances [0] were not killed in time > Exceeded maximum number of errors while killing instances > ``` > > - Checking Status > ``` > vagrant@aurora:~$ aurora job status devcluster > INFO] Retrieving jobs for role None > INFO] Checking status of devcluster/vagrant/test/partition_aware_disabled > Active tasks (1): > Task role: vagrant, env: test, name: partition_aware_disabled, > instance: 0, status: KILLING on 192.168.33.7 > CPU: 0.2 core(s), RAM: 1 MB, Disk: 8 MB > events: > 2018-01-25 04:07:08 PENDING: None > 2018-01-25 04:07:08 ASSIGNED: None > 2018-01-25 04:07:09 STARTING: None > 2018-01-25 04:07:10 RUNNING: No health-check defined, task is > assumed healthy. > 2018-01-25 04:10:10 KILLING: Killed by aurora > ``` > > Scheduler Logs: > ``` > I0125 04:11:18.817 [Thread-66, > MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for > task > vagrant-test-partition_aware_disabled-0-3cb6e9ae-6643-460e-a08b-0bdba4bdd8d4 > in state TASK_UNREACHABLE from SOURCE_MASTER with REASON_AGENT_REMOVED: Agent > 192.168.33.7 is unreachable: health check timed out > W0125 04:11:18.818 [AsyncProcessor-0, Stats] Re-using already registered > variable for key task_delivery_delay_SOURCE_MASTER_timeouts_per_sec > W0125 04:11:18.818 [AsyncProcessor-0, Stats] Re-using already registered > variable for key task_delivery_delay_SOURCE_MASTER_requests_per_sec > I0125 04:11:18.819 [TaskStatusHandlerImpl, StateMachine] > vagrant-test-partition_aware_disabled-0-3cb6e9ae-6643-460e-a08b-0bdba4bdd8d4 > state machine transition KILLING -> PARTITIONED > I0125 04:11:18.820 [TaskStatusHandlerImpl, StateMachine] > vagrant-test-partition_aware_disabled-0-3cb6e9ae-6643-460e-a08b-0bdba4bdd8d4 > state machine transition PARTITIONED -> LOST > I0125 04:11:18.820 [TaskStatusHandlerImpl, StateManagerImpl] Task being > rescheduled: > vagrant-test-partition_aware_disabled-0-3cb6e9ae-6643-460e-a08b-0bdba4bdd8d4 > I0125 04:11:18.821 [TaskStatusHandlerImpl, StateMachine] > vagrant-test-partition_aware_disabled-0-27c9bed4-708d-4d71-a0c1-33584b81c654 > state machine transition INIT -> PENDING > ``` > > As pointed out in https://issues.apache.org/jira/browse/AURORA-1966, this > causes the Scheduler to indefinitely keep killing the partitioned task.
Right, that bug aside, KILLING -> PARTITIONED triggers an immediate transition to LOST. The intermediate state of PARTITONED is basically a noop. - David ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65281/#review196198 ----------------------------------------------------------- On Jan. 25, 2018, 2:04 a.m., David McLaughlin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65281/ > ----------------------------------------------------------- > > (Updated Jan. 25, 2018, 2:04 a.m.) > > > Review request for Aurora and Jordan Ly. > > > Repository: aurora > > > Description > ------- > > Support PARTITIONED state in SLA calculations. Also added a test to protect > against this test failing in the future. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java > 5d8d5bd8f705770979f284d26d2e932aabe707e5 > src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java > 2e719ac6b7aea86faa22deff2cc6b5f73135761c > > > Diff: https://reviews.apache.org/r/65281/diff/2/ > > > Testing > ------- > > ./gradlew test > > > Thanks, > > David McLaughlin > >
