Re: Review Request 65281: Support PARTITIONED state in SLA calculations

David McLaughlin Wed, 24 Jan 2018 22:06:42 -0800


> On Jan. 25, 2018, 2:59 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java
> > Lines 333 (patched)
> > <https://reviews.apache.org/r/65281/diff/2/?file=1946477#file1946477line333>
> >
> >     We should only consider `UP` if the previous state is also `UP` for 
> > `PARTITIONED` state. For instance, `KILLING` -> `PARTITIONED` should we 
> > counted as `REMOVED`.
> 
> David McLaughlin wrote:
>     You cannot move from KILLING to PARTITIONED.
> 
> Santhosh Kumar Shanmugham wrote:
>     I was able to trigger a transition from KILLING to PARTITIONED.
>     
>     - Creating job
>     ```
>     vagrant@aurora:~$ aurora job create 
> devcluster/vagrant/test/partition_aware_disabled 
> aurora/src/test/sh/org/apache/aurora/e2e/partition_aware.aurora
>      INFO] Creating job partition_aware_disabled
>      INFO] Checking status of devcluster/vagrant/test/partition_aware_disabled
>     Job create succeeded: job 
> url=http://aurora.local:8081/scheduler/vagrant/test/partition_aware_disabled
>     ```
>     
>     - Checking Status
>     ```
>     vagrant@aurora:~$ aurora job status devcluster
>      INFO] Retrieving jobs for role None
>      INFO] Checking status of devcluster/vagrant/test/partition_aware_disabled
>     Active tasks (1):
>       Task role: vagrant, env: test, name: partition_aware_disabled, 
> instance: 0, status: RUNNING on 192.168.33.7
>         CPU: 0.2 core(s), RAM: 1 MB, Disk: 8 MB
>         events:
>          2018-01-25 04:07:08 PENDING: None
>          2018-01-25 04:07:08 ASSIGNED: None
>          2018-01-25 04:07:09 STARTING: None
>          2018-01-25 04:07:10 RUNNING: No health-check defined, task is 
> assumed healthy.
>     ```
>     
>     - Create Partition
>     ```
>     vagrant@aurora:~$ sudo stop mesos-slave
>     mesos-slave stop/waiting
>     ```
>     
>     - Kill task
>     ```
>     vagrant@aurora:~$ aurora job killall 
> devcluster/vagrant/test/partition_aware_disabled
>      INFO] Killing tasks for job: 
> devcluster/vagrant/test/partition_aware_disabled
>      INFO] Instances to be killed: [0]
>     Instances [0] were not killed in time
>     Exceeded maximum number of errors while killing instances
>     ```
>     
>     - Checking Status
>     ```
>     vagrant@aurora:~$ aurora job status devcluster
>      INFO] Retrieving jobs for role None
>      INFO] Checking status of devcluster/vagrant/test/partition_aware_disabled
>     Active tasks (1):
>       Task role: vagrant, env: test, name: partition_aware_disabled, 
> instance: 0, status: KILLING on 192.168.33.7
>         CPU: 0.2 core(s), RAM: 1 MB, Disk: 8 MB
>         events:
>          2018-01-25 04:07:08 PENDING: None
>          2018-01-25 04:07:08 ASSIGNED: None
>          2018-01-25 04:07:09 STARTING: None
>          2018-01-25 04:07:10 RUNNING: No health-check defined, task is 
> assumed healthy.
>          2018-01-25 04:10:10 KILLING: Killed by aurora
>     ```
>     
>     Scheduler Logs:
>     ```
>     I0125 04:11:18.817 [Thread-66, 
> MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for 
> task 
> vagrant-test-partition_aware_disabled-0-3cb6e9ae-6643-460e-a08b-0bdba4bdd8d4 
> in state TASK_UNREACHABLE from SOURCE_MASTER with REASON_AGENT_REMOVED: Agent 
> 192.168.33.7 is unreachable: health check timed out
>     W0125 04:11:18.818 [AsyncProcessor-0, Stats] Re-using already registered 
> variable for key task_delivery_delay_SOURCE_MASTER_timeouts_per_sec
>     W0125 04:11:18.818 [AsyncProcessor-0, Stats] Re-using already registered 
> variable for key task_delivery_delay_SOURCE_MASTER_requests_per_sec
>     I0125 04:11:18.819 [TaskStatusHandlerImpl, StateMachine] 
> vagrant-test-partition_aware_disabled-0-3cb6e9ae-6643-460e-a08b-0bdba4bdd8d4 
> state machine transition KILLING -> PARTITIONED
>     I0125 04:11:18.820 [TaskStatusHandlerImpl, StateMachine] 
> vagrant-test-partition_aware_disabled-0-3cb6e9ae-6643-460e-a08b-0bdba4bdd8d4 
> state machine transition PARTITIONED -> LOST
>     I0125 04:11:18.820 [TaskStatusHandlerImpl, StateManagerImpl] Task being 
> rescheduled: 
> vagrant-test-partition_aware_disabled-0-3cb6e9ae-6643-460e-a08b-0bdba4bdd8d4
>     I0125 04:11:18.821 [TaskStatusHandlerImpl, StateMachine] 
> vagrant-test-partition_aware_disabled-0-27c9bed4-708d-4d71-a0c1-33584b81c654 
> state machine transition INIT -> PENDING
>     ```
>     
>     As pointed out in https://issues.apache.org/jira/browse/AURORA-1966, this 
> causes the Scheduler to indefinitely keep killing the partitioned task.


Right, that bug aside, KILLING -> PARTITIONED triggers an immediate transition 
to LOST. The intermediate state of PARTITONED is basically a noop.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65281/#review196198
-----------------------------------------------------------


On Jan. 25, 2018, 2:04 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65281/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2018, 2:04 a.m.)
> 
> 
> Review request for Aurora and Jordan Ly.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Support PARTITIONED state in SLA calculations. Also added a test to protect 
> against this test failing in the future.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java 
> 5d8d5bd8f705770979f284d26d2e932aabe707e5 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java 
> 2e719ac6b7aea86faa22deff2cc6b5f73135761c 
> 
> 
> Diff: https://reviews.apache.org/r/65281/diff/2/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>

Re: Review Request 65281: Support PARTITIONED state in SLA calculations

Reply via email to