[GitHub] drill pull request #928: Drill 5716: Queue-driven memory allocation

paul-rogers Wed, 30 Aug 2017 15:27:11 -0700

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/928


    Drill 5716: Queue-driven memory allocation

    Please see [DRILL-5716](https://issues.apache.org/jira/browse/DRILL-5716) 
for an overview.
    
    This PR provides five commits that implement a method of allocating memory 
to queries based on ZK-based queues. The key concept is that the system has a 
fixed amount of memory. Suppose all queries need the same amount of memory. In 
that case, if we want to run n queries, each gets an amount that is the total 
memory divided by n.
    
    This raises two questions:
    
    1. Queries don't need the same amount of memory, so how do we handle this?
    2. How do we know how many queries, n, are to run within Drill?
    
    The solution is to leverage an existing feature: the Zookeeper (ZK) based 
queueing mechanism. This mechanism limits (throttles) the number of active 
queries. If we set the limit to n, then we know that Drill won't run more than 
n queries, and we can apply the above math.
    
    But, not all queries are the same size. The ZK-based throttling mechanism 
addressed this by allowing *two* queues: a "large query" queue and a "small 
query" queue. The idea is that the cluster might run a single large ETL query, 
but at the same time, run five or 10 small interactive queries. The numbers are 
configurable via system options. The existing mechanism uses query cost to 
decide when a query is "small" vs. "large." The user sets a threshold in terms 
of cost. Queries split into the two queues accordingly.
    
    ### Resource Manager Layer
    
    The first commit introduces a new resource management layer with a number 
of concepts:
    
    * A pluggable, global resource manager configured at boot time. The code 
offers three versions: the null manager (no throttling), the ZK-based one 
discussed above, and a test-only one used in unit tests that uses a simple 
in-process queue.
    * A Per-query resource allocator created by the resource manager to work 
with a single query through the query lifecycle.
    
    Since the default manager does no queueing, queues themselves are internal 
to the resource manager. The model assumes:
    
    * A query queue that manages the Drillbit's view into the distributed queue 
state.
    * A queue "lease" that represents a grant of a single slot within a queue 
to run a query.
    
    The design handles the current resource managers, and allows creating 
custom schedulers as needed for specific distributions or applications.
    
    Prior to this PR, the Foreman worked with the ZK-based queues directly 
inline in the Foreman code. With the above framework in place, this PR 
refactors the Foreman to extract the ZK-based queue code into a resource 
manager implementation. Functionality is identical, just the location of the 
code moves to allow a cleaner design.
    
    One interesting design issue is worth pointing out. The user can enable and 
disable ZK-queues at run time. However, the resource manager is global. To make 
this work, a âdynamicâ resource manager wraps the ZK-based implementation. 
The dynamic version periodically checks system options to see if the ZK-based 
version is enabled. The dynamic version calls to either the ZK-based version or 
the default (non-throttled) version depending on what it finds in the system 
options.
    
    When a switch occurs, queries already queued will stay queued. Only new 
queries use the new setting. This provides a graceful transition to/from 
throttled mode.
    
    ### Specifics of Memory Planning
    
    The core of this change is a new memory planner. Here's how it works.
    
    * The user enables queues using the existing `exec.queue.enable` session 
option.
    * The user decides the number of "small" vs. "large" queries to run 
concurrently by setting `exec.queue.small` and `exec.queue.large` respectively. 
(The original numbers for these settings were far to high, this PR changes the 
defaults to much lower values.)
    * Decide how to split memory across the two queues by specifying the 
`exec.queue.memory_ratio` value. The default is 10, which means that large 
queries get 10 units of memory while small queries get 1 unit.
    * Decide on the planner cost threshold that separates "small" vs. "large" 
queries using `exec.queue.threshold`.
    * Decide on the maximum queue wait time by setting 
`exec.queue.timeout_millis`. (Default is five minutes.)
    
    The memory planner then does the following:
    
    * Determine the number of âmemory unitsâ as the number of concurrent 
small queries + the number of concurrent large queries * the memory ratio.
    * Suppose this is a small query. Compute actual memory as system memory / 
number of memory units.
    * Traverse the plan. Find the buffering operators grouped by node. 
(Buffering operators, in this release, are sort and hash agg.)
    * Suppose we find that, on node X, we have two sorts and a hash agg. This 
is three operators on that node.
    * Determine the number of minor fragments running on that node. Suppose 
there are 10.
    * Divide the per-node memory across the operators on each node and each 
minor fragment. Here, we divide our total memory by 3 * 10 = 30 to get the 
per-operator memory amount.
    
    The above is more complex than the existing memory allocation utilities. 
The extra behavior handles a special case seen in several production 
environments: the case in which a query runs only in the root fragment on a 
single node, so that there is only a single minor fragment. The existing code 
still divides memory by the possible number of fragments, even if the actual 
number is 1. The new code gracefully handles this case and will give all memory 
to the single fragment in this case.
    
    One refinement that is still needed is to specify a âreserveâ (of, say, 
10-20%) to account for operators that decline to limit their memory usage.
    
    ### Planner and Foreman Changes
    
    Memory planning now depends on the resource manager. The default manager 
assigns memory using the existing `MemoryAllocationUtilities`. The ZK-based 
version uses the math outlined above.
    
    As a result, the third commit reworks the parallelizers to shift memory 
planning into a separate memory planning stage managed by the resource manager. 
As a result, we must shift creation of the JSON form of the query plan until 
after memory allocation is done.
    
    We assume that some future resource manager may learn of available memory 
only after a queue lease is granted. So, we shift memory planning until after a 
queue slot is available; but before launching fragments.
    
    ### Web UI Enhancements
    
    Early testing revealed that a major limitation of the existing ZK-based 
queues is that they are opaque: the user can't tell what is happening. To 
address this, two UI changes were made:
    
    * If ZK-based queues are enabled, information about them now appears on the 
main Web UI page for the Drillbit.
    * The main query profile page shows, for each query, the query cost (used 
to select the "small" vs. "large" queue) and the name of the selected queue: 
"small" or "large."
    
    In addition, system/session options are now sorted by name to make it 
easier to find the queue-related options.
    
    The above required changes to the query profile definition Protobuf (in 
commit 2) as well as changes in the various "models" and "templates" used by 
the Web UI. Note that the new information also automatically appears in the 
corresponding JSON REST API calls.
    
    ### Code Cleanup
    
    Commit 5 has a large number of code cleanup items found while doing the 
above work. These do not affect functionality.
    
    ### Limitations and Future Improvements
    
    The change in this PR is far from perfect; but it is the best we can do 
given the limitations of the existing ZK-based queues. In particular:
    
    * There is no way to determine the total queue length: the number of 
queries waiting in the queue.
    * There is no good way to cancel a query stuck in the queue.
    * JDBC/ODBC clients wonât know that their queries are queued, or be able 
to cancel queued queries.
    * The two-queue model is highly simplistic; real applications need a 
variety of queues and the means to categorize queries into queues.
    * The ZK-queues allow no prioritization.
    * The queueing mechanism does not limit CPU use, only memory use.
    * The planner number used to split queries into queues is not very accurate 
or intuitive.
    * No provision is made to understand that some queries need little memory, 
others need a huge amount. Instead, all queries are either âsmallâ or 
âlarge.â
    
    All that said, the present PR is a (small) step in the right direction and 
lays the groundwork for additional, future efforts.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5716

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/928.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #928
    
----
commit db69cbea57d1460a25463d09784ff850b2030f5f
Author: Paul Rogers <[email protected]>
Date:   2017-08-30T21:32:17Z

    DRILL-5716: Queue-driven memory allocation
    
    Commit 1: Core resource management and query queue abstractions.

commit 5dd7e83db122867368b23f0a878c540e212e3a5c
Author: Paul Rogers <[email protected]>
Date:   2017-08-30T21:33:07Z

    Commit 2: RPC changes
    
    Adds queue information to the Protobuf layer.

commit ba43570023d18b3c76256c0cc1727e16dbcd5d1f
Author: Paul Rogers <[email protected]>
Date:   2017-08-30T21:36:23Z

    Commit 3: Foreman and Planner changes
    
    Abstracts memory management out to the new resource management layer.
    This means deferring generating the physical plan JSON to later in the
    process after memory planning.

commit d924527ebdeda34ef9bd575e30b77e753654d75b
Author: Paul Rogers <[email protected]>
Date:   2017-08-30T21:38:17Z

    Commit 4: Web UI changes
    
    Adds queue information to the main page and the profile page to each
    query.

commit edacd1f4623aab8acf19d6e9ebfb9d2627241ab3
Author: Paul Rogers <[email protected]>
Date:   2017-08-30T21:38:48Z

    Commit 5: Other changes
    
    Code cleanup. Also sorts the list of options displayed in the Web UI.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #928: Drill 5716: Queue-driven memory allocation

Reply via email to