Bill Farner created AURORA-909:
----------------------------------
Summary: Make task scheduling more efficient
Key: AURORA-909
URL: https://issues.apache.org/jira/browse/AURORA-909
Project: Aurora
Issue Type: Story
Components: Scheduler
Reporter: Bill Farner
We're making a decent effort at reducing the _cost_ of task scheduling
operations, abut have not yet invested in reducing the working set in a way
that causes task scheduling to scale better. Each scheduling attempt for each
task is an O(n) operation, where n is the number of offers.
I would like to explore optimizations where we try to reduce the amount of
redundant work performed in task scheduling. Say, for example, we're trying to
schedule a task that needs 2 CPUs, and we only have offers with 1 CPU. Each
scheduling round will re-assess every offer, despite the fact that the offers
have not changed shape, and will always be a mismatch (hereafter termed
_static_ mismatches). Instead, we should try to skip over offers that are a
static mismatch. We could do this at the {{TaskGroup}} level, since every
element in a task group is by definition statically equivalent. This means
that jobs with a large number of instances could be scheduled very efficiently,
since the first task scheduling round could identify static mismatches,
reducing the working set in the next round.
This is to contrast with _dynamic_ mismatches, where a change in the tasks on a
machine or other settings could make a previously-ineligible offer become a
match. The current sources of dynamic mismatches are limit constraints, host
maintenance modes, and dedicated attributes.
I propose we proceed in several steps, re-evaluating after each:
1. instrument the scheduler to better estimate the improvements
2. avoid future (offer, task group) evaluations when static mismatches are found
3. avoid future (offer, task group) evaluations when dynamic mismatches are
found
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)