[ 
https://issues.apache.org/jira/browse/DRILL-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunhui Shi resolved DRILL-4237.
--------------------------------
       Resolution: Fixed
         Reviewer: Aman Sinha
    Fix Version/s: 1.7.0

> Skew in hash distribution
> -------------------------
>
>                 Key: DRILL-4237
>                 URL: https://issues.apache.org/jira/browse/DRILL-4237
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.4.0
>            Reporter: Aman Sinha
>            Assignee: Chunhui Shi
>             Fix For: 1.7.0
>
>
> Apparently, the fix in DRILL-4119 did not fully resolve the data skew issue.  
> It worked fine on the smaller sample of the data set but on another sample of 
> the same data set, it still produces skewed values - see below the hash 
> values which are all odd numbers. 
> {noformat}
> 0: jdbc:drill:zk=local> select columns[0], hash32(columns[0]) from `test.csv` 
> limit 10;
> +-----------------------------------+--------------+
> |              EXPR$0               |    EXPR$1    |
> +-----------------------------------+--------------+
> | f71aaddec3316ae18d43cb1467e88a41  | 1506011089   |
> | 3f3a13bb45618542b5ac9d9536704d3a  | 1105719049   |
> | 6935afd0c693c67bba482cedb7a2919b  | -18137557    |
> | ca2a938d6d7e57bda40501578f98c2a8  | -1372666789  |
> | fab7f08402c8836563b0a5c94dbf0aec  | -1930778239  |
> | 9eb4620dcb68a84d17209da279236431  | -970026001   |
> | 16eed4a4e801b98550b4ff504242961e  | 356133757    |
> | a46f7935fea578ce61d8dd45bfbc2b3d  | -94010449    |
> | 7fdf5344536080c15deb2b5a2975a2b7  | -141361507   |
> | b82560a06e2e51b461c9fe134a8211bd  | -375376717   |
> +-----------------------------------+--------------+
> {noformat}
> This indicates an underlying issue with the XXHash64 java implementation, 
> which is Drill's implementation of the C version.  One of the key difference 
> as pointed out by [~jnadeau] was the use of unsigned int64 in the C version 
> compared to the Java version which uses (signed) long.  I created an XXHash 
> version using com.google.common.primitives.UnsignedLong.  However, 
> UnsignedLong does not have bit-wise operations that are needed for XXHash 
> such as rotateLeft(),  XOR etc.  One could write wrappers for these but at 
> this point, the question is: should we think of an alternative hash function 
> ? 
> The alternative approach could be the murmur hash for numeric data types that 
> we were using earlier and the Mahout version of hash function for string 
> types 
> (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/HashHelper.java#L28).
>   As a test, I reverted to this function and was getting good hash 
> distribution for the test data. 
> I could not find any performance comparisons of our perf tests (TPC-H or DS) 
> with the original and newer (XXHash) hash functions.  If performance is 
> comparable, should we revert to the original function ?  
> As an aside, I would like to remove the hash64 versions of the functions 
> since these are not used anywhere. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to