[jira] [Commented] (AURORA-470) Tasks get stuck in THROTTLED state on restart or leader change

Nathan Howell (JIRA) Fri, 23 May 2014 16:19:06 -0700

    [ 
https://issues.apache.org/jira/browse/AURORA-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007857#comment-14007857
 ]


Nathan Howell commented on AURORA-470:
--------------------------------------

It's from an older build, but I didn't see any obviously related changes or 
tickets. I turned down the flapping interval to 10 seconds and started up a 
service that exits after about 10 seconds.

This is on 7db986e53c74e87ec368e395af55300d1711d261 from late March, I couldn't 
get a trivial example to repro on rc0 but haven't tried one with master 
failover.

{code}
I0523 20:51:30.002 THREAD18 
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle 
state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
I0523 20:51:30.002 THREAD18 
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as leading 
scheduler!
...
0523 20:53:17.968 THREAD165 
org.apache.aurora.scheduler.MesosSchedulerImpl.statusUpdate: Received status 
update for task 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19 in 
state TASK_FINISHED with core message Task finished.
I0523 20:53:17.981 THREAD165 
com.twitter.common.util.StateMachine$Builder$1.execute: 
1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19 state machine 
transition RUNNING -> FINISHED
I0523 20:53:17.981 THREAD165 
org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work 
command RESCHEDULE for 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19
I0523 20:53:17.981 THREAD165 
org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work 
command SAVE_STATE for 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19
I0523 20:53:17.982 THREAD165 
org.apache.aurora.scheduler.state.StateManagerImpl$7.apply: Task being 
rescheduled: 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19
I0523 20:53:17.982 THREAD165 
org.apache.aurora.scheduler.async.RescheduleCalculator$RescheduleCalculatorImpl.getFlappingPenaltyMs:
 Ancestor of 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19 flapped: 
1400878228688-xxx-0-01d4c232-981a-455f-b6d3-43559f1af22a
I0523 20:53:17.982 THREAD165 
com.twitter.common.util.StateMachine$Builder$1.execute: 
1400878397982-xxx-0-58777fe5-9eef-4a46-a123-8f240169ea86 state machine 
transition INIT -> THROTTLED
I0523 20:53:17.983 THREAD165 
org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work 
command SAVE_STATE for 1400878397982-xxx-0-58777fe5-9eef-4a46-a123-8f240169ea86
{code}

!http://i.imgur.com/2FWEPdH.png!

> Tasks get stuck in THROTTLED state on restart or leader change
> --------------------------------------------------------------
>
>                 Key: AURORA-470
>                 URL: https://issues.apache.org/jira/browse/AURORA-470
>             Project: Aurora
>          Issue Type: Story
>          Components: Scheduler
>    Affects Versions: 0.5.0
>            Reporter: Nathan Howell
>
> We're seeing cases where tasks get stuck in the THROTTLED state indefinitely. 
> From what I can tell from the logs, this happens if a task is throttled when 
> Aurora is shutdown or a new leader is elected.
> It looks like the timer that changes the state from THROTTLED to PENDING is 
> only setup on a transition to the THROTTLED state... it seems like there  is 
> no way to get these tasks running again except to restart them manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (AURORA-470) Tasks get stuck in THROTTLED state on restart or leader change

Reply via email to