Bill Farner created AURORA-1953: ----------------------------------- Summary: Scheduler livelock during startup Key: AURORA-1953 URL: https://issues.apache.org/jira/browse/AURORA-1953 Project: Aurora Issue Type: Bug Components: Scheduler Affects Versions: 0.18.0 Reporter: Bill Farner Priority: Blocker
The scheduler may experience a "livelock" situation while starting up due to async events on a {{ThreadPoolExecutor}} that require other not-yet-executed events to be processed. If enough of these blocking events occur simultaneously, no further event processing occurs and the scheduler stalls. More specifically, this section of {{TaskGroups}} is afflicted: {code} CompletableFuture<Set<String>> result = batchWorker.execute(storeProvider -> taskScheduler.schedule(storeProvider, taskIds)); Set<String> scheduled = null; try { scheduled = result.get(); {code} {{batchWorker#execute}} submits to a queue that is not processed until a {{SchedulerActive}} event is fired within the scheduler. {{SchedulerActive}} is sent via an {{AsyncEventBus}} which happens to also trigger the above code from {{TaskGroups}}. Therefore, the following sequence of events will cause a livelock: {noformat} TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending DriverRegistered {noformat} Any other events may occur between the above calls, but the important sequence is N {{TaskStateChange=pending}} events, where N={{-async_worker_threads}} followed by {{DriverRegistered}}. This issue was exacerbated by [f2755e1|https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01], which has the subtle effect of not using {{GatingDelayExecutor#closeDuring()}}, which would enqueue all these events until storage recovery is complete. The on-demand execution greatly increases the likelihood of the above event sequence, since driver registration begins strictly after storage recovery completes. -- This message was sent by Atlassian JIRA (v6.4.14#64029)