[
https://issues.apache.org/jira/browse/AURORA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bill Farner resolved AURORA-1953.
---------------------------------
Resolution: Fixed
> Scheduler livelock during startup
> ---------------------------------
>
> Key: AURORA-1953
> URL: https://issues.apache.org/jira/browse/AURORA-1953
> Project: Aurora
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 0.18.0
> Reporter: Bill Farner
> Assignee: Jordan Ly
> Priority: Blocker
>
> The scheduler may experience a "livelock" situation while starting up due to
> async events on a {{ThreadPoolExecutor}} that require other not-yet-executed
> events to be processed. If enough of these blocking events occur
> simultaneously, no further event processing occurs and the scheduler stalls.
> More specifically, this section of {{TaskGroups}} is afflicted:
> {code}
> CompletableFuture<Set<String>> result = batchWorker.execute(storeProvider ->
> taskScheduler.schedule(storeProvider, taskIds));
> Set<String> scheduled = null;
> try {
> scheduled = result.get();
> {code}
> {{batchWorker#execute}} submits to a queue that is not processed until a
> {{SchedulerActive}} event is fired within the scheduler. {{SchedulerActive}}
> is sent via an {{AsyncEventBus}} which happens to also trigger the above code
> from {{TaskGroups}}. Therefore, the following sequence of events will cause
> a livelock:
> {noformat}
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> TaskStateChange=pending
> DriverRegistered
> {noformat}
> Any other events may occur between the above calls, but the important
> sequence is N {{TaskStateChange=pending}} events, where
> N={{-async_worker_threads}} followed by {{DriverRegistered}}.
> This issue was exacerbated by
> [f2755e1|https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01],
> which has the subtle effect of not using
> {{GatingDelayExecutor#closeDuring()}}, which would enqueue all these events
> until storage recovery is complete. The on-demand execution greatly
> increases the likelihood of the above event sequence, since driver
> registration begins strictly after storage recovery completes.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)