[ 
https://issues.apache.org/jira/browse/AURORA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221710#comment-16221710
 ] 

Bill Farner commented on AURORA-1953:
-------------------------------------

https://reviews.apache.org/r/63316/ is the current candidate to address this 
issue

> Scheduler livelock during startup
> ---------------------------------
>
>                 Key: AURORA-1953
>                 URL: https://issues.apache.org/jira/browse/AURORA-1953
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.18.0
>            Reporter: Bill Farner
>            Priority: Blocker
>
> The scheduler may experience a "livelock" situation while starting up due to 
> async events on a {{ThreadPoolExecutor}} that require other not-yet-executed 
> events to be processed.  If enough of these blocking events occur 
> simultaneously, no further event processing occurs and the scheduler stalls.
> More specifically, this section of {{TaskGroups}} is afflicted:
> {code}
> CompletableFuture<Set<String>> result = batchWorker.execute(storeProvider ->
>     taskScheduler.schedule(storeProvider, taskIds));
> Set<String> scheduled = null;
> try {
>   scheduled = result.get();
> {code}
> {{batchWorker#execute}} submits to a queue that is not processed until a 
> {{SchedulerActive}} event is fired within the scheduler.  {{SchedulerActive}} 
> is sent via an {{AsyncEventBus}} which happens to also trigger the above code 
> from {{TaskGroups}}.  Therefore, the following sequence of events will cause 
> a livelock:
> {noformat}
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> DriverRegistered
> {noformat}
> Any other events may occur between the above calls, but the important 
> sequence is N {{TaskStateChange=pending}} events, where 
> N={{-async_worker_threads}} followed by {{DriverRegistered}}.
> This issue was exacerbated by 
> [f2755e1|https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01],
>  which has the subtle effect of not using 
> {{GatingDelayExecutor#closeDuring()}}, which would enqueue all these events 
> until storage recovery is complete.  The on-demand execution greatly 
> increases the likelihood of the above event sequence, since driver 
> registration begins strictly after storage recovery completes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to