[ 
https://issues.apache.org/jira/browse/DRILL-6745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622501#comment-16622501
 ] 

Parth Chandra commented on DRILL-6745:
--------------------------------------

We used to have an xxHash implementation: 
[XXHash.java|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/XXHash.java].
 The implementation was a translation of the xxHash C implementation generating 
a 64 bit hash into a Java implementation generating a 32 bit hash.

The result was a hash mechanism that, for some data sets, did not produce a 
uniform distribution leading to skewed data distribution and worse performance. 
The problem was that the 64 bit hash does not map uniformly to 32 bits. Fixing 
this is non trivial.

I took a quick look at Spark and it appears to use a 64 bit hash which probably 
avoids the problem.

Moving to using a 64 bit hash has the additional challenge that you will have 
to update all the operators to use a 64 bit hash.

Also note that computing and using a 32 bit Murmur hash may be faster than 
computing and using a 64 bit xxHash. The 32 bit hash definitely uses less 
memory. There are no benchmarks comparing the two.

 

> Introduce the xxHash algorithm as another hash64 option
> -------------------------------------------------------
>
>                 Key: DRILL-6745
>                 URL: https://issues.apache.org/jira/browse/DRILL-6745
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>            Priority: Major
>             Fix For: 1.15.0
>
>
> Supply another hash64 algorithm : xxHash as a replacer to MurmurHash. 
> According to [xxHash|http://cyan4973.github.io/xxHash/] report , it is more 
> faster than MurmurHash  and projects like Spark ,Presto have adopted it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to