Ben-Zvi opened a new pull request #1606: Drill 6845: Semi-Hash-Join to skip incoming build duplicates, automatically stop skipping if too few URL: https://github.com/apache/drill/pull/1606 The first two commits here were extracted from the original PR #1522 (DRILL-6735), where the Semi-Hash-Join was implemented in a straightforward way: Read data like a regular hash join (e.g. into partitions, then later build hash-tables), and only during probe time perform at most a single probe match. The issue with the above implementation is the case of excessive incoming build-side duplicates (more common with synthetic data in benchmarks). In such a case, reading *all* the data first can blow up the hash join memory (e.g., cause spills) and regress performance. This PR addresses the problem by creating the hash-tables first, and using them to detect build duplicates early (before copying from incoming into partitions), so those duplicates can be simply ignored/skipped (see the new method `insertKeyIntoHashTable()`). After all the build side is read (if no spill), there is no need to build the hash tables as they already exist - see the new method `buildContainers()` . All this logic is in the **first** commit. The issue with this logic is that it adds overhead (e.g., hash table doubling), which is a waste when there are very little duplicates. So this issue is addressed by the second commit. (Also note the new option `semi_skip_duplicates` that can be used to disable this whole feature). The **second** commit performs some "runtime statistics" to decide if there are too few duplicates. In such a case, it drops those hash tables and falls back to the simple semi-join work (a la PR #1522). This decision uses a "threshold", which is half the size of all the hash tables (so they won't double), and incoming duplicates are counted. After so many incoming rows are processed, the percentage of duplicates is checked - if under %20 (hard coded), then stop skipping, else continue using the hash tables to eliminate the duplicates. The **third** commit extends the memory manager to handle this special "duplicate skipping" mode. With a new class `HashJoinSpillControlImpl` and interface `HashJoinMemoryCalculator.HashJoinSpillControl`. The technique used for `shouldSpill()` is simply ensuring that the available memory is large enough for at least 3 (see below) more batches. That required a change to all the `shouldSpill()` methods - add the `currentVectorContainer` parameter. Most of the code changes in HashPartition were a rename (batch -> vectorContainer) and in HashJoinBatch (added "semi" to some variable names). As for "running out of memory" while inserting into the hash table (either allocating a new keys batch, or resizing the hash table) -- this is handled by the hash table throwing `RetryAfterSpillException`, which is caught in the new `insertKeyIntoHashTable()` which leads to a spill, and a reset of the hash table anyway, and return false (it's a new key - it would be inserted into the new empty hash-table). So this case is much simpler than Hash-Aggr. The **fourth** commit adds an option `min_batches_in_available_memory` instead of the above hard coded "3". Also added a method `IntegerValidator` that can specify the min/max values.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services