[
https://issues.apache.org/jira/browse/DRILL-6745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622501#comment-16622501
]
Parth Chandra commented on DRILL-6745:
--------------------------------------
We used to have an xxHash implementation:
[XXHash.java|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/XXHash.java].
The implementation was a translation of the xxHash C implementation generating
a 64 bit hash into a Java implementation generating a 32 bit hash.
The result was a hash mechanism that, for some data sets, did not produce a
uniform distribution leading to skewed data distribution and worse performance.
The problem was that the 64 bit hash does not map uniformly to 32 bits. Fixing
this is non trivial.
I took a quick look at Spark and it appears to use a 64 bit hash which probably
avoids the problem.
Moving to using a 64 bit hash has the additional challenge that you will have
to update all the operators to use a 64 bit hash.
Also note that computing and using a 32 bit Murmur hash may be faster than
computing and using a 64 bit xxHash. The 32 bit hash definitely uses less
memory. There are no benchmarks comparing the two.
> Introduce the xxHash algorithm as another hash64 option
> -------------------------------------------------------
>
> Key: DRILL-6745
> URL: https://issues.apache.org/jira/browse/DRILL-6745
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: weijie.tong
> Assignee: weijie.tong
> Priority: Major
> Fix For: 1.15.0
>
>
> Supply another hash64 algorithm : xxHash as a replacer to MurmurHash.
> According to [xxHash|http://cyan4973.github.io/xxHash/] report , it is more
> faster than MurmurHash and projects like Spark ,Presto have adopted it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)