[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678852#comment-16678852
 ] 

Gopal V commented on HIVE-20873:
--------------------------------

[~bslim]: Teddy & I have a UDF for the hash function, which we use to calculate 
skews.

I've merged Teddy's changes into it

https://github.com/t3rmin4t0r/long-hash-udf

{code}
select long2hash(i_item_sk, 1) & 255, count(1)  from item group by 
long2hash(i_item_sk, 1) & 255 order by count(1) desc ;

0       65536
2       65536
3       65536
1       65535
5       37857
{code}

So there's a bit-skew in the old hash function, instead of generating 256 
unique bit-patterns, but it skews the low-bits by the 2nd arg to the long2 hash.

{code}
select long2murmur(i_item_sk, 1) & 255, count(1)  from item group by 
long2murmur(i_item_sk, 1) & 255 order by count(1) desc ;

170     1274
37      1264
220     1254
110     1253
152     1241
5       1235
56      1232
179     1231
231     1228
168     1228
149     1228
84      1222
...
156     1082
Time taken: 1.727 seconds, Fetched: 256 row(s)
{code} 

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> ------------------------------------------------------------------------
>
>                 Key: HIVE-20873
>                 URL: https://issues.apache.org/jira/browse/HIVE-20873
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Teddy Choi
>            Assignee: Teddy Choi
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to