[drill] branch gh-pages updated: add info about semi-join functionality

bridgetb Tue, 18 Dec 2018 22:01:29 -0800

This is an automated email from the ASF dual-hosted git repository.

bridgetb pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git



The following commit(s) were added to refs/heads/gh-pages by this push:
     new da8628f  add info about semi-join functionality
da8628f is described below

commit da8628f2a18d4edef60dda7b74dcc0a76cf5c0ef
Author: Bridget Bevens <[email protected]>
AuthorDate: Tue Dec 18 22:00:23 2018 -0800

    add info about semi-join functionality
---
 ...-and-hash-based-memory-constrained-operators.md | 47 ++++++++++++++++++++--
 1 file changed, 43 insertions(+), 4 deletions(-)

diff --git 
a/_docs/performance-tuning/query-plans-and-tuning/050-sort-based-and-hash-based-memory-constrained-operators.md
 
b/_docs/performance-tuning/query-plans-and-tuning/050-sort-based-and-hash-based-memory-constrained-operators.md
index 8047638..3d45f2e 100644
--- 
a/_docs/performance-tuning/query-plans-and-tuning/050-sort-based-and-hash-based-memory-constrained-operators.md
+++ 
b/_docs/performance-tuning/query-plans-and-tuning/050-sort-based-and-hash-based-memory-constrained-operators.md
@@ -1,17 +1,17 @@
 ---
 title: "Sort-Based and Hash-Based Memory-Constrained Operators"
-date: 2018-09-06 01:45:58 UTC
+date: 2018-12-19
 parent: "Query Plans and Tuning"
 ---  
 
 Drill supports the following memory-intensive operators, which can temporarily 
spill data to disk if they run out of memory:  
 
 - External Sort
-- Hash-Join
+- Hash-Join (with semi-join functionality in Drill 1.15 and later)
 - Hash-Aggregate
 
 Drill only uses the External Sort operator to sort data. Drill uses the 
Hash-Aggregate operator to aggregate data. Alternatively, Drill can sort the 
data and then use the (lightweight) Streaming-Aggregate operator to aggregate 
data.
-Drill uses the Hash-Join operator to join data. Alternatively, Drill can use 
the Nested-Loop-Join or sort the data and then use the (lightweight) 
Merge-Join. Drill typically uses Hash operators for joining and aggregation, as 
they perform better than the Sort operator (Hash - O(N) vs. Sort - O(N * 
log(N))). However, if you disable the Hash operators, or the data is already 
sorted, Drill uses the alternative methods previously described.
+Drill uses the Hash-Join operator to join data. Drill 1.15 introduces 
semi-join functionality inside the Hash-Join operator to improve query 
performance. Semi-joins remove the distinct processing below the Hash-Join and 
eliminate the overhead incurred from using a Hash Aggregate. Prior to Drill 
1.15 (or when [semi-join functionality is 
disabled]({{site.baseurl}}/docs/sort-based-and-hash-based-memory-constrained-operators/#disabling-the-hash-operators)),
 Drill uses a Distinct Hash Aggrega [...]
 
 The memory configuration in Drill is specified as the memory limit per-query, 
per-node. The allocated memory is equally divided among all instances of the 
spillable operators (per query on each node). The number of instances is the 
number of spillable operators in the query plan multiplied by the maximal 
degree of parallelism. The maximal degree of parallelism is the number of minor 
fragments required to perform the work for each instance of a spillable 
operator. When an instance of a sp [...]
 
@@ -101,7 +101,46 @@ The following options control the hash-based operators:
 Enables or disables hash aggregation; otherwise, Drill does a sort-based 
aggregation. This option is enabled by default. The default, and recommended, 
setting is true. Prior to Drill 1.11, the Hash Aggregate operator used an 
uncontrolled amount of memory (up to 10 GB), after which the operator ran out 
of memory. As of Drill 1.11, the Hash Aggregate operator can spill to disk.  
 
 - **planner.enable_hashjoin**  
-Enables or disables hash joins. This option is enabled by default. Drill 
assumes that a query will have adequate memory to complete and tries to use the 
fastest operations possible. Prior to Drill 1.14, the Hash-Join operator used 
an uncontrolled amount of memory (up to 10 GB), after which the operator ran 
out of memory. As of Drill 1.14, this operator can spill to disk. This option 
is enabled by default.
+Enables or disables hash joins. This option is enabled by default. Drill 
assumes that a query will have adequate memory to complete and tries to use the 
fastest operations possible. Prior to Drill 1.14, the Hash-Join operator used 
an uncontrolled amount of memory (up to 10 GB), after which the operator ran 
out of memory. As of Drill 1.14, this operator can spill to disk. This option 
is enabled by default.    
+
+- **planner.enable_semijoin**  
+Enables or disables semi-joins. This option is enabled by default and only 
works when the `planner.enable_hashjoin` option is also enabled. When enabled, 
Drill uses semi-joins to remove the distinct processing below the Hash Join and 
sets the semi-join flag in the Hash Join flag, as shown in the following 
example:  
+
+###Example: Query Plan with and without Semi-Join
+
+**Semi-Join Disabled**   
+In the following query plan, you can see the HashAgg before the HashJoin. In 
the HashJoin flag, you can see that semi-join flag is set to false, indicating 
that a semi-join was not used.  
+
+       EXPLAIN PLAN FOR SELECT employee_id, full_name FROM cp.`employee.json` 
WHERE employee_id IN (SELECT employee_id FROM cp.`employee.json`);  
+
+       
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+       |                                       text                            
           |                                                            
+       
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
+       | 00-00    Screen
+       00-01      Project(employee_id=[$0], full_name=[$1])
+       00-02        Project(employee_id=[$0], full_name=[$1])
+       00-03          HashJoin(condition=[=($0, $2)], joinType=[inner], 
semi-join: =[false])
+       00-05            Scan(table=[[cp, employee.json]], 
groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json, numFiles=1, 
columns=[`employee_id`, `full_name`], files=[classpath:/employee.json]]])
+       00-04            Project(employee_id0=[$0])
+       00-06              HashAgg(group=[{0}])
+       00-07                Scan(table=[[cp, employee.json]], 
groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json, numFiles=1, 
columns=[`employee_id`], files=[classpath:/employee.json]]])
+       planner.enable_semijoin  
+
+**Semi-Join Enabled**   
+In the following query plan, you can see that the HashAgg is absent. In the 
HashJoin flag, you can see that semi-join flag is set to true, indicating that 
a semi-join was used. Using the semi-join optimizes the query by reducing the 
amount of processing that Drill must perform on data.   
+
+       EXPLAIN PLAN FOR SELECT employee_id, full_name FROM cp.`employee.json` 
WHERE employee_id IN (SELECT employee_id FROM cp.`employee.json`);
+       
--------------------------------------------------------------------------------+
+       |                                       text                            
           |                                     
+----------------------------------------------------------------------------------+
+       | 00-00    Screen
+       00-01      Project(employee_id=[$0], full_name=[$1])
+       00-02        HashJoin(condition=[=($0, $2)], joinType=[inner], 
semi-join: =[true])
+       00-04          Scan(table=[[cp, employee.json]], 
groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json, numFiles=1, 
columns=[`employee_id`, `full_name`], files=[classpath:/employee.json]]])
+       00-03          Scan(table=[[cp, employee.json]], 
groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json, numFiles=1, 
columns=[`employee_id`], files=[classpath:/employee.json]]])
+
+
+
+

[drill] branch gh-pages updated: add info about semi-join functionality

Reply via email to