[
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024802#comment-15024802
]
Aman Sinha commented on DRILL-4119:
-----------------------------------
I did some more testing with the sample data. Here are 3 hash values:
- hash64 is the native hash64 computed by XXHash.hash64()
- hash64_downcast is the same value downcast to int
- newhash is the the new 32 bit hash value computed by the proposed fix
(combining the first and last 4 bytes of hash64).
{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0]) as hash64,
castInt(hash64(columns[0])) as hash64_downcast, hash32(columns[0]) as newhash
from dfs.`/Users/asinha/data/sample.csv`;
+-----------------------------------+----------------------+------------------+--------------+
| id | hash64 | hash64_downcast |
newhash |
+-----------------------------------+----------------------+------------------+--------------+
| 1a883d005e0ce003b918d737ac697e7c | 6695077304582944118 | 594687350 |
2140898336 |
| e4b4388e8865819126cb0e4dcaa7261d | 2614721709087477964 | -2136387380 |
-1528820922 |
| 639a06fb09c70cc397666d38a8134af5 | 3943910117127083836 | 359520060 |
601244263 |
| ae03f853f40c307aa24894e414a6dfdc | 4320987148691340574 | 214334750 |
925976565 |
| 2dd3fdace36431e3810437bee1c7e3f1 | 5657579594883017754 | -1719653350 |
-687608144 |
| 00abdb137380e6ea8cb3e67df40c30dd | 5039129256017100358 | 573406790 |
1740892954 |
| d65d4e30ec96a588e82847aca619e4a0 | 550451582126160076 | 716077260 |
755884032 |
| 956f968866b3151ad472edfcafb579fa | 39366413145792912 | 1336074640 |
1328101915 |
| 75577f830d12c86fd1de94d45cfa0715 | 6480730101791620276 | -226984780 |
-1417129724 |
| 298aa703dbee9e5f303372fe7a764975 | 7844015280248941602 | -2013696990 |
-350034316 |
+-----------------------------------+----------------------+------------------+--------------+
10 rows selected (0.228 seconds)
{noformat}
A key observation is that all hash64 values are even numbers. This is not a
good thing. I confirmed the behavior over a larger sample of 100 rows.
However, this seems specific to strings that are 32 chars (or maybe longer,
although a simple test for a 64 char string did not show the same pattern).
I then modified the seed value to 1 (default is 0). This time I got better
distribution for the hash64:
{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0], 1) as
hash64, castInt(hash64(columns[0], 1)) as hash64_downcast, hash32(columns[0],
1) as newhash from dfs.`/Users/asinha/data/sample.csv`;
+-----------------------------------+----------------------+------------------+-------------+
| id | hash64 | hash64_downcast |
newhash |
+-----------------------------------+----------------------+------------------+-------------+
| 1a883d005e0ce003b918d737ac697e7c | 3877569168361489241 | 1211204441 |
2113824708 |
| e4b4388e8865819126cb0e4dcaa7261d | 5555510472474498931 | 567154547 |
1826042916 |
| 639a06fb09c70cc397666d38a8134af5 | 6160367672898924663 | 1827713143 |
965653941 |
| ae03f853f40c307aa24894e414a6dfdc | 5573714012720216212 | 533608596 |
1385691081 |
| 2dd3fdace36431e3810437bee1c7e3f1 | 4742615352245986962 | 284141202 |
1363050779 |
| 00abdb137380e6ea8cb3e67df40c30dd | 5870154798330275502 | 185067182 |
1517362206 |
| d65d4e30ec96a588e82847aca619e4a0 | 5469776233948339425 | 828202209 |
2058735712 |
| 956f968866b3151ad472edfcafb579fa | 8671446365158603789 | -1675527155 |
-462006645 |
| 75577f830d12c86fd1de94d45cfa0715 | 3369914886384026207 | 238584415 |
553440739 |
| 298aa703dbee9e5f303372fe7a764975 | 3765901389360033496 | 1811181272 |
1605846404 |
+-----------------------------------+----------------------+------------------+-------------+
10 rows selected (0.263 seconds)
{noformat}
I am thinking we should put in the proposed fix I sent earlier since it
improves things. Separately, I think we need to investigate the quality of the
XXHash.hash64 implementation. BTW, I also downloaded the original XXHash's C
implementation and based on an initial analysis that one produces different
hash value than our implementation and does not seem to have the same 'even
number' pattern.
> Skew in hash distribution for varchar (and possibly other) types of data
> ------------------------------------------------------------------------
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
> Issue Type: Bug
> Components: Functions - Drill
> Affects Versions: 1.3.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of
> length 32. It is easily reproducible by a group-by query:
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02 HashAgg(group=[{0}])
> 01-03 Project(SomeId=[$0])
> 01-04 HashToRandomExchange(dist0=[[$0]])
> 02-01 UnorderedMuxExchange
> 03-01 Project(SomeId=[$0],
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02 HashAgg(group=[{0}])
> 03-03 Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type:
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)