[
https://issues.apache.org/jira/browse/AURORA-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007857#comment-14007857
]
Nathan Howell commented on AURORA-470:
--------------------------------------
It's from an older build, but I didn't see any obviously related changes or
tickets. I turned down the flapping interval to 10 seconds and started up a
service that exits after about 10 seconds.
This is on 7db986e53c74e87ec368e395af55300d1711d261 from late March, I couldn't
get a trivial example to repro on rc0 but haven't tried one with master
failover.
{code}
I0523 20:51:30.002 THREAD18
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
I0523 20:51:30.002 THREAD18
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as leading
scheduler!
...
0523 20:53:17.968 THREAD165
org.apache.aurora.scheduler.MesosSchedulerImpl.statusUpdate: Received status
update for task 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19 in
state TASK_FINISHED with core message Task finished.
I0523 20:53:17.981 THREAD165
com.twitter.common.util.StateMachine$Builder$1.execute:
1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19 state machine
transition RUNNING -> FINISHED
I0523 20:53:17.981 THREAD165
org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work
command RESCHEDULE for 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19
I0523 20:53:17.981 THREAD165
org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work
command SAVE_STATE for 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19
I0523 20:53:17.982 THREAD165
org.apache.aurora.scheduler.state.StateManagerImpl$7.apply: Task being
rescheduled: 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19
I0523 20:53:17.982 THREAD165
org.apache.aurora.scheduler.async.RescheduleCalculator$RescheduleCalculatorImpl.getFlappingPenaltyMs:
Ancestor of 1400878323661-xxx-0-f11c6fbf-7fe5-4c89-8005-534909443e19 flapped:
1400878228688-xxx-0-01d4c232-981a-455f-b6d3-43559f1af22a
I0523 20:53:17.982 THREAD165
com.twitter.common.util.StateMachine$Builder$1.execute:
1400878397982-xxx-0-58777fe5-9eef-4a46-a123-8f240169ea86 state machine
transition INIT -> THROTTLED
I0523 20:53:17.983 THREAD165
org.apache.aurora.scheduler.state.TaskStateMachine.addFollowup: Adding work
command SAVE_STATE for 1400878397982-xxx-0-58777fe5-9eef-4a46-a123-8f240169ea86
{code}
!http://i.imgur.com/2FWEPdH.png!
> Tasks get stuck in THROTTLED state on restart or leader change
> --------------------------------------------------------------
>
> Key: AURORA-470
> URL: https://issues.apache.org/jira/browse/AURORA-470
> Project: Aurora
> Issue Type: Story
> Components: Scheduler
> Affects Versions: 0.5.0
> Reporter: Nathan Howell
>
> We're seeing cases where tasks get stuck in the THROTTLED state indefinitely.
> From what I can tell from the logs, this happens if a task is throttled when
> Aurora is shutdown or a new leader is elected.
> It looks like the timer that changes the state from THROTTLED to PENDING is
> only setup on a transition to the THROTTLED state... it seems like there is
> no way to get these tasks running again except to restart them manually.
--
This message was sent by Atlassian JIRA
(v6.2#6252)