GitHub user paul-rogers opened a pull request:
https://github.com/apache/drill/pull/928
Drill 5716: Queue-driven memory allocation
Please see [DRILL-5716](https://issues.apache.org/jira/browse/DRILL-5716)
for an overview.
This PR provides five commits that implement a method of allocating memory
to queries based on ZK-based queues. The key concept is that the system has a
fixed amount of memory. Suppose all queries need the same amount of memory. In
that case, if we want to run n queries, each gets an amount that is the total
memory divided by n.
This raises two questions:
1. Queries don't need the same amount of memory, so how do we handle this?
2. How do we know how many queries, n, are to run within Drill?
The solution is to leverage an existing feature: the Zookeeper (ZK) based
queueing mechanism. This mechanism limits (throttles) the number of active
queries. If we set the limit to n, then we know that Drill won't run more than
n queries, and we can apply the above math.
But, not all queries are the same size. The ZK-based throttling mechanism
addressed this by allowing *two* queues: a "large query" queue and a "small
query" queue. The idea is that the cluster might run a single large ETL query,
but at the same time, run five or 10 small interactive queries. The numbers are
configurable via system options. The existing mechanism uses query cost to
decide when a query is "small" vs. "large." The user sets a threshold in terms
of cost. Queries split into the two queues accordingly.
### Resource Manager Layer
The first commit introduces a new resource management layer with a number
of concepts:
* A pluggable, global resource manager configured at boot time. The code
offers three versions: the null manager (no throttling), the ZK-based one
discussed above, and a test-only one used in unit tests that uses a simple
in-process queue.
* A Per-query resource allocator created by the resource manager to work
with a single query through the query lifecycle.
Since the default manager does no queueing, queues themselves are internal
to the resource manager. The model assumes:
* A query queue that manages the Drillbit's view into the distributed queue
state.
* A queue "lease" that represents a grant of a single slot within a queue
to run a query.
The design handles the current resource managers, and allows creating
custom schedulers as needed for specific distributions or applications.
Prior to this PR, the Foreman worked with the ZK-based queues directly
inline in the Foreman code. With the above framework in place, this PR
refactors the Foreman to extract the ZK-based queue code into a resource
manager implementation. Functionality is identical, just the location of the
code moves to allow a cleaner design.
One interesting design issue is worth pointing out. The user can enable and
disable ZK-queues at run time. However, the resource manager is global. To make
this work, a âdynamicâ resource manager wraps the ZK-based implementation.
The dynamic version periodically checks system options to see if the ZK-based
version is enabled. The dynamic version calls to either the ZK-based version or
the default (non-throttled) version depending on what it finds in the system
options.
When a switch occurs, queries already queued will stay queued. Only new
queries use the new setting. This provides a graceful transition to/from
throttled mode.
### Specifics of Memory Planning
The core of this change is a new memory planner. Here's how it works.
* The user enables queues using the existing `exec.queue.enable` session
option.
* The user decides the number of "small" vs. "large" queries to run
concurrently by setting `exec.queue.small` and `exec.queue.large` respectively.
(The original numbers for these settings were far to high, this PR changes the
defaults to much lower values.)
* Decide how to split memory across the two queues by specifying the
`exec.queue.memory_ratio` value. The default is 10, which means that large
queries get 10 units of memory while small queries get 1 unit.
* Decide on the planner cost threshold that separates "small" vs. "large"
queries using `exec.queue.threshold`.
* Decide on the maximum queue wait time by setting
`exec.queue.timeout_millis`. (Default is five minutes.)
The memory planner then does the following:
* Determine the number of âmemory unitsâ as the number of concurrent
small queries + the number of concurrent large queries * the memory ratio.
* Suppose this is a small query. Compute actual memory as system memory /
number of memory units.
* Traverse the plan. Find the buffering operators grouped by node.
(Buffering operators, in this release, are sort and hash agg.)
* Suppose we find that, on node X, we have two sorts and a hash agg. This
is three operators on that node.
* Determine the number of minor fragments running on that node. Suppose
there are 10.
* Divide the per-node memory across the operators on each node and each
minor fragment. Here, we divide our total memory by 3 * 10 = 30 to get the
per-operator memory amount.
The above is more complex than the existing memory allocation utilities.
The extra behavior handles a special case seen in several production
environments: the case in which a query runs only in the root fragment on a
single node, so that there is only a single minor fragment. The existing code
still divides memory by the possible number of fragments, even if the actual
number is 1. The new code gracefully handles this case and will give all memory
to the single fragment in this case.
One refinement that is still needed is to specify a âreserveâ (of, say,
10-20%) to account for operators that decline to limit their memory usage.
### Planner and Foreman Changes
Memory planning now depends on the resource manager. The default manager
assigns memory using the existing `MemoryAllocationUtilities`. The ZK-based
version uses the math outlined above.
As a result, the third commit reworks the parallelizers to shift memory
planning into a separate memory planning stage managed by the resource manager.
As a result, we must shift creation of the JSON form of the query plan until
after memory allocation is done.
We assume that some future resource manager may learn of available memory
only after a queue lease is granted. So, we shift memory planning until after a
queue slot is available; but before launching fragments.
### Web UI Enhancements
Early testing revealed that a major limitation of the existing ZK-based
queues is that they are opaque: the user can't tell what is happening. To
address this, two UI changes were made:
* If ZK-based queues are enabled, information about them now appears on the
main Web UI page for the Drillbit.
* The main query profile page shows, for each query, the query cost (used
to select the "small" vs. "large" queue) and the name of the selected queue:
"small" or "large."
In addition, system/session options are now sorted by name to make it
easier to find the queue-related options.
The above required changes to the query profile definition Protobuf (in
commit 2) as well as changes in the various "models" and "templates" used by
the Web UI. Note that the new information also automatically appears in the
corresponding JSON REST API calls.
### Code Cleanup
Commit 5 has a large number of code cleanup items found while doing the
above work. These do not affect functionality.
### Limitations and Future Improvements
The change in this PR is far from perfect; but it is the best we can do
given the limitations of the existing ZK-based queues. In particular:
* There is no way to determine the total queue length: the number of
queries waiting in the queue.
* There is no good way to cancel a query stuck in the queue.
* JDBC/ODBC clients wonât know that their queries are queued, or be able
to cancel queued queries.
* The two-queue model is highly simplistic; real applications need a
variety of queues and the means to categorize queries into queues.
* The ZK-queues allow no prioritization.
* The queueing mechanism does not limit CPU use, only memory use.
* The planner number used to split queries into queues is not very accurate
or intuitive.
* No provision is made to understand that some queries need little memory,
others need a huge amount. Instead, all queries are either âsmallâ or
âlarge.â
All that said, the present PR is a (small) step in the right direction and
lays the groundwork for additional, future efforts.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/paul-rogers/drill DRILL-5716
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/928.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #928
----
commit db69cbea57d1460a25463d09784ff850b2030f5f
Author: Paul Rogers <[email protected]>
Date: 2017-08-30T21:32:17Z
DRILL-5716: Queue-driven memory allocation
Commit 1: Core resource management and query queue abstractions.
commit 5dd7e83db122867368b23f0a878c540e212e3a5c
Author: Paul Rogers <[email protected]>
Date: 2017-08-30T21:33:07Z
Commit 2: RPC changes
Adds queue information to the Protobuf layer.
commit ba43570023d18b3c76256c0cc1727e16dbcd5d1f
Author: Paul Rogers <[email protected]>
Date: 2017-08-30T21:36:23Z
Commit 3: Foreman and Planner changes
Abstracts memory management out to the new resource management layer.
This means deferring generating the physical plan JSON to later in the
process after memory planning.
commit d924527ebdeda34ef9bd575e30b77e753654d75b
Author: Paul Rogers <[email protected]>
Date: 2017-08-30T21:38:17Z
Commit 4: Web UI changes
Adds queue information to the main page and the profile page to each
query.
commit edacd1f4623aab8acf19d6e9ebfb9d2627241ab3
Author: Paul Rogers <[email protected]>
Date: 2017-08-30T21:38:48Z
Commit 5: Other changes
Code cleanup. Also sorts the list of options displayed in the Web UI.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---