Ben-Zvi opened a new pull request #1606: Drill 6845: Semi-Hash-Join to skip 
incoming build duplicates, automatically stop skipping if too few
URL: https://github.com/apache/drill/pull/1606
 
 
   The first two commits here were extracted from the original PR #1522 
(DRILL-6735), where the Semi-Hash-Join was implemented in a straightforward 
way: Read data like a regular hash join (e.g. into partitions, then later build 
hash-tables), and only during probe time perform at most a single probe match. 
      The issue with the above implementation is the case of excessive incoming 
build-side duplicates (more common with synthetic data in benchmarks). In such 
a case, reading *all* the data first can blow up the hash join memory (e.g., 
cause spills) and regress performance.
   
      This PR addresses the problem by creating the hash-tables first, and 
using them to detect build duplicates early (before copying from incoming into 
partitions), so those duplicates can be simply ignored/skipped (see the new 
method `insertKeyIntoHashTable()`).  After all the build side is read (if no 
spill), there is no need to build the hash tables as they already exist - see 
the new method `buildContainers()` .  
       All this logic is in the **first** commit. The issue with this logic is 
that it adds overhead (e.g., hash table doubling), which is a waste when there 
are very little duplicates. So this issue is addressed by the second commit. 
(Also note the new option `semi_skip_duplicates` that can be used to disable 
this whole feature).
   
      The **second** commit performs some "runtime statistics" to decide if 
there are too few duplicates. In such a case, it drops those hash tables and 
falls back to the simple semi-join work (a la PR #1522). This decision uses a 
"threshold", which is half the size of all the hash tables (so they won't 
double), and incoming duplicates are counted. After so many incoming rows are 
processed, the percentage of duplicates is checked - if under %20 (hard coded), 
then stop skipping, else continue using the hash tables to eliminate the 
duplicates.
   
     The **third** commit extends the memory manager to handle this special 
"duplicate skipping" mode. With a new class `HashJoinSpillControlImpl` and 
interface `HashJoinMemoryCalculator.HashJoinSpillControl`. The technique used 
for `shouldSpill()` is simply ensuring that the available memory is large 
enough for at least 3 (see below) more batches.  That required a change to all 
the `shouldSpill()` methods - add the `currentVectorContainer` parameter.
       Most of the code changes in HashPartition were a rename (batch -> 
vectorContainer) and in HashJoinBatch (added "semi" to some variable names).
       As for "running out of memory" while inserting into the hash table 
(either allocating a new keys batch, or resizing the hash table) -- this is 
handled by the hash table throwing `RetryAfterSpillException`, which is caught 
in the new `insertKeyIntoHashTable()` which leads to a spill, and a reset of 
the hash table anyway, and return false (it's a new key - it would be inserted 
into the new empty hash-table).  So this case is much simpler than Hash-Aggr.
   
      The **fourth** commit adds an option `min_batches_in_available_memory` 
instead of the above hard coded "3". Also added a method `IntegerValidator` 
that can specify the min/max values. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to