This is an automated email from the ASF dual-hosted git repository. bridgetb pushed a commit to branch gh-pages in repository https://gitbox.apache.org/repos/asf/drill.git
The following commit(s) were added to refs/heads/gh-pages by this push: new 896d4dd edits 896d4dd is described below commit 896d4ddfd7b7b25fcee8bc75a2a0f8e185c1ca0f Author: Bridget Bevens <bbev...@maprtech.com> AuthorDate: Wed Sep 5 18:45:57 2018 -0700 edits --- ...-and-hash-based-memory-constrained-operators.md | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/_docs/performance-tuning/query-plans-and-tuning/050-sort-based-and-hash-based-memory-constrained-operators.md b/_docs/performance-tuning/query-plans-and-tuning/050-sort-based-and-hash-based-memory-constrained-operators.md index c7e658b..8047638 100644 --- a/_docs/performance-tuning/query-plans-and-tuning/050-sort-based-and-hash-based-memory-constrained-operators.md +++ b/_docs/performance-tuning/query-plans-and-tuning/050-sort-based-and-hash-based-memory-constrained-operators.md @@ -1,6 +1,6 @@ --- title: "Sort-Based and Hash-Based Memory-Constrained Operators" -date: 2018-09-06 01:31:07 UTC +date: 2018-09-06 01:45:58 UTC parent: "Query Plans and Tuning" --- @@ -11,12 +11,9 @@ Drill supports the following memory-intensive operators, which can temporarily s - Hash-Aggregate Drill only uses the External Sort operator to sort data. Drill uses the Hash-Aggregate operator to aggregate data. Alternatively, Drill can sort the data and then use the (lightweight) Streaming-Aggregate operator to aggregate data. -Drill uses the Hash-Join operator to join data. Alternatively, Drill can use the Nested-Loop-Join or sort the data and then use the (lightweight) Merge-Join. Drill typically uses Hash operators for joining and aggregation, as they perform better than the Sort operator (Hash - O(N) vs. Sort - O(N * log(N))). However, if the Hash operators are disabled, or the data is already sorted, Drill uses the alternative methods previously described. - -The memory configuration in Drill is specified as the memory limit per-query, per-node. The allocated memory is equally divided among all instances of the spillable operators (per query on each node). The number of instances is the number of spillable operators in the query plan multiplied by the maximal degree of parallelism. The maximal degree of parallelism is the number of minor fragments required to perform the work for each instance of a spillable operator. When an instance of a sp [...] - -To see the difference in memory consumption between the operators, run a query and then view the query profile in the Drill Web UI. Optionally, you can disable the Hash operators, which forces Drill to use the Merge-Join and Streaming-Aggregate operators. +Drill uses the Hash-Join operator to join data. Alternatively, Drill can use the Nested-Loop-Join or sort the data and then use the (lightweight) Merge-Join. Drill typically uses Hash operators for joining and aggregation, as they perform better than the Sort operator (Hash - O(N) vs. Sort - O(N * log(N))). However, if you disable the Hash operators, or the data is already sorted, Drill uses the alternative methods previously described. +The memory configuration in Drill is specified as the memory limit per-query, per-node. The allocated memory is equally divided among all instances of the spillable operators (per query on each node). The number of instances is the number of spillable operators in the query plan multiplied by the maximal degree of parallelism. The maximal degree of parallelism is the number of minor fragments required to perform the work for each instance of a spillable operator. When an instance of a sp [...] ##Spill to Disk @@ -32,7 +29,7 @@ Ideally, you want to allocate enough memory for Drill to perform all operations Spillable operators write data to a temporary work area on disk when they cannot process all of the data in memory. The default location of the temporary work area is `/tmp/drill/spill` on the local file system. -The `/tmp/drill/spill` directory should suffice for small workloads or examples, however it is highly recommended that you redirect the default spill location to a location with enough disk space to support spilling for large workloads. +The `/tmp/drill/spill` directory should suffice for small workloads or examples; however, you should redirect the default spill location to a location with enough disk space to support spilling for large workloads. **Note:** Spilled data may require more space than the table referenced in the query that is spilling the data. For example, when the underlying table is compressed (Parquet), or when the operator received data joined from multiple tables. @@ -40,14 +37,15 @@ When you configure the spill location, you can specify a single directory or a l **Configuring Spill to Disk** -The `drill-override.conf` file, located in the `/conf` directory, contains options that set the spill locations for the Hash and Sort operators. An administrator can change the file system and directories into which the operators spill data. Refer to the `drill-override-example.conf` file included in the `/conf` directory for examples. +The `drill-override.conf` file, located in the `/conf` directory, contains options that set the spill locations for the spillable operators. An administrator can change the file system and directories into which the operators spill data. Refer to the `drill-override-example.conf` file included in the `/conf` directory for examples. The following list describes the spill to disk configuration options: - **drill.exec.spill.fs** -Introduced in Drill 1.11. The default file system on the local machine into which the Sort, Hash Aggregate, and Hash Join operators spill data. You can configure this option so that data spills into a distributed file system, such as hdfs. For example, "hdfs:///". The default setting is "file:///". +Introduced in Drill 1.11. The default file system on the local machine into which the spillable operators spill data. You can configure this option so that data spills into a distributed file system, such as hdfs. For example, "hdfs:///". The default setting is "file:///". + - **drill.exec.spill.directories** -Introduced in Drill 1.11. The list of directories into which the Sort, Hash Aggregate, and Hash Join operators spill data. The list must be an array with directories separated by a comma, for example ["/fs1/drill/spill" , "/fs2/drill/spill" , "/fs3/drill/spill"]. The default setting is ["/tmp/drill/spill"]. +Introduced in Drill 1.11. The list of directories into which the spillable operators spill data. The list must be an array with directories separated by a comma, for example ["/fs1/drill/spill" , "/fs2/drill/spill" , "/fs3/drill/spill"]. The default setting is ["/tmp/drill/spill"]. **Note:** The following options were available prior to Drill 1.11, but have since been deprecated and replaced with the options described above: @@ -58,7 +56,7 @@ Introduced in Drill 1.11. The list of directories into which the Sort, Hash Aggr ##Memory Allocation -Drill evenly splits the available memory among all instances of the spillable operators. When a query is parallelized, the number of operators is multiplied, which reduces the amount of memory given to each instance of the operators during a query. +Drill evenly splits the available memory among all instances of the spillable operators. When a query is parallelized, the number of operators is multiplied, which reduces the amount of memory given to each instance of the operators during a query. To see the difference in memory consumption between the operators, you can run a query and then view the query profile in the Drill Web UI. Optionally, you can disable the Hash operators, which forces Drill to use the Merge-Join and Streaming- [...] **Memory Allocation Configuration Options** @@ -68,7 +66,7 @@ The `planner.memory.max_query_memory_per_node` and `planner.memory.percent_per_q The `planner.memory.max_query_memory_per_node` option is the minimum amount of memory available to Drill per query on a node. The default of 2 GB typically allows between two and three concurrent queries to run when the JVM is configured to use 8 GB of direct memory (default). When the memory requirement for Drill increases, the default of 2 GB is constraining. You must increase the amount of memory for queries to complete, unless the setting for the `planner.memory.percent_per_query` op [...] - **planner.memory.percent\_per_query** -Alternatively, the `planner.memory.percent_per_query` option sets the memory as a percentage of the total direct memory. The default is 5%. This value is only used when throttling is disabled. Setting the value to 0 disables the option. You can increase or decrease the value, however you should set the percentage well below the JVM direct memory to account for the cases where Drill does not manage memory, such as for the less memory intensive operators. +Alternatively, the `planner.memory.percent_per_query` option sets the memory as a percentage of the total direct memory. The default is 5%. This value is only used when throttling is disabled. Setting the value to 0 disables the option. You can increase or decrease the value; however, you should set the percentage well below the JVM direct memory to account for the cases where Drill does not manage memory, such as for the less memory intensive operators. - The percentage is calculated using the following formula: