[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15022389#comment-15022389
 ] 

ASF GitHub Bot commented on DRILL-4119:
---------------------------------------

GitHub user amansinha100 opened a pull request:

    https://github.com/apache/drill/pull/279

    DRILL-4119: Modify hash32 functions to combine the msb and lsb bytes …

    …of a 64-bit hash value (previously, we were casting to integer).
    
     - Use this new set of functions (for all data types) for creating the hash 
values needed for hash distribution, hash joins etc.
     - Rename HashFunctions to Hash32Functions to be consistent with the Hash64 
counterpart.
     - Many data types did not have a hash32AsDouble equivalent...added these.
     - Add hash32 functions with seed.
     - Fix unit tests, add "hash" as a synonym for "hash32".

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/amansinha100/incubator-drill hashfunc2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #279
    
----
commit 113d8adf70ffd359c747a5076da30e0c6a82f39b
Author: Aman Sinha <[email protected]>
Date:   2015-11-21T18:23:52Z

    DRILL-4119: Modify hash32 functions to combine the msb and lsb bytes of a 
64-bit hash value (previously, we were casting to integer).
     - Use this new set of functions (for all data types) for creating the hash 
values needed for hash distribution, hash joins etc.
     - Rename HashFunctions to Hash32Functions to be consistent with the Hash64 
counterpart.
     - Many data types did not have a hash32AsDouble equivalent...added these.
     - Add hash32 functions with seed.
     - Fix unit tests, add "hash" as a synonym for "hash32".

----


> Skew in hash distribution for varchar (and possibly other) types of data
> ------------------------------------------------------------------------
>
>                 Key: DRILL-4119
>                 URL: https://issues.apache.org/jira/browse/DRILL-4119
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.3.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02          HashAgg(group=[{0}])
> 01-03            Project(SomeId=[$0])
> 01-04              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02                    HashAgg(group=[{0}])
> 03-03                      Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to