[jira] [Created] (DRILL-5293) Poor performance of Hash Table due to same hash value as distribution below

Boaz Ben-Zvi (JIRA) Wed, 22 Feb 2017 15:01:39 -0800

Boaz Ben-Zvi created DRILL-5293:
-----------------------------------

             Summary: Poor performance of Hash Table due to same hash value as 
distribution below
                 Key: DRILL-5293
                 URL: https://issues.apache.org/jira/browse/DRILL-5293
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Codegen
    Affects Versions: 1.8.0
            Reporter: Boaz Ben-Zvi
            Assignee: Boaz Ben-Zvi



The computation of the hash value is basically the same whether for the Hash 
Table (used by Hash Agg, and Hash Join), or for distribution of rows at the 
exchange. As a result, a specific Hash Table (in a parallel minor fragment) 
gets only rows "filtered out" by the partition below ("upstream"), so the 
pattern of this filtering leads to a non uniform usage of the hash buckets in 
the table.
  Here is a simplified example: An exchange partitions into TWO (minor 
fragments), each running a Hash Agg. So the partition sends rows of EVEN hash 
values to the first, and rows of ODD hash values to the second. Now the first 
recomputes the _same_ hash value for its Hash table -- and only the even 
buckets get used !!  (Or with a partition into EIGHT -- possibly only one 
eighth of the buckets would be used !! ) 

   This would lead to longer hash chains and thus a _poor performance_ !

A possible solution -- add a distribution function distFunc (only for 
partitioning) that takes the hash value and "scrambles" it so that the entropy 
in all the bits effects the low bits of the output. This function should be 
applied (in HashPrelUtil) over the generated code that produces the hash value, 
like:

   distFunc( hash32(field1, hash32(field2, hash32(field3, 0))) );

Tested with a huge hash aggregate (64 M rows) and a parallelism of 8 ( 
planner.width.max_per_node = 8 ); minor fragments 0 and 4 used only 1/8 of 
their buckets, the others used 1/4 of their buckets.  Maybe the reason for this 
variance is that distribution is using "hash32AsDouble" and hash agg is using 
"hash32".  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (DRILL-5293) Poor performance of Hash Table due to same hash value as distribution below

Reply via email to