Bill Farner created AURORA-1953:
-----------------------------------
Summary: Scheduler livelock during startup
Key: AURORA-1953
URL: https://issues.apache.org/jira/browse/AURORA-1953
Project: Aurora
Issue Type: Bug
Components: Scheduler
Affects Versions: 0.18.0
Reporter: Bill Farner
Priority: Blocker
The scheduler may experience a "livelock" situation while starting up due to
async events on a {{ThreadPoolExecutor}} that require other not-yet-executed
events to be processed. If enough of these blocking events occur
simultaneously, no further event processing occurs and the scheduler stalls.
More specifically, this section of {{TaskGroups}} is afflicted:
{code}
CompletableFuture<Set<String>> result = batchWorker.execute(storeProvider ->
taskScheduler.schedule(storeProvider, taskIds));
Set<String> scheduled = null;
try {
scheduled = result.get();
{code}
{{batchWorker#execute}} submits to a queue that is not processed until a
{{SchedulerActive}} event is fired within the scheduler. {{SchedulerActive}}
is sent via an {{AsyncEventBus}} which happens to also trigger the above code
from {{TaskGroups}}. Therefore, the following sequence of events will cause a
livelock:
{noformat}
TaskStateChange=pending
TaskStateChange=pending
TaskStateChange=pending
TaskStateChange=pending
TaskStateChange=pending
TaskStateChange=pending
TaskStateChange=pending
TaskStateChange=pending
DriverRegistered
{noformat}
Any other events may occur between the above calls, but the important sequence
is N {{TaskStateChange=pending}} events, where N={{-async_worker_threads}}
followed by {{DriverRegistered}}.
This issue was exacerbated by
[f2755e1|https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01],
which has the subtle effect of not using
{{GatingDelayExecutor#closeDuring()}}, which would enqueue all these events
until storage recovery is complete. The on-demand execution greatly increases
the likelihood of the above event sequence, since driver registration begins
strictly after storage recovery completes.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)