Sorabh Hamirwasia created DRILL-5816:
----------------------------------------

             Summary: Hash function produces skewed results on String values 
with same leading prefix
                 Key: DRILL-5816
                 URL: https://issues.apache.org/jira/browse/DRILL-5816
             Project: Apache Drill
          Issue Type: Bug
            Reporter: Sorabh Hamirwasia
            Assignee: Sorabh Hamirwasia
             Fix For: 1.12.0


Reported by [~amansinha100]

Hashing of string values (for the hash exchange) could produce substantial skew 
for certain types of strings that have the same leading prefix.
Here's the sample data: (note all strings begin with 'mscId=' followed by 
numeric values)

0: jdbc:drill:drillbit=10.10.103.111> select a from dfs.tmp.vv3 limit 20;
+---------------------+
|          a          |
+---------------------+
| mscId=100139170495  |
| mscId=100103806655  |
| mscId=100229137840  |
| mscId=100362859440  |
| mscId=100032583600  |
| mscId=100125021360  |
| mscId=100243775920  |
| mscId=100152820405  |
| mscId=100084724405  |
| mscId=100297398970  |
| mscId=100059560890  |
| mscId=100106108090  |
| mscId=100032092090  |
| mscId=100029460410  |
| mscId=100110390995  |
| mscId=100019105235  |
| mscId=100354644435  |
| mscId=100288523475  |
| mscId=100214507475  |
| mscId=100296418515  |
+---------------------+
20 rows selected (0.33 seconds)

Here's the hash values using the hash function that Drill uses for the 
HashToRandomExchange (note that they are all even numbers):

0: jdbc:drill:drillbit=10.10.103.111> select hash32AsDouble(a, 1301011) from 
dfs.tmp.vv3 limit 20;
+--------------+
|    EXPR$0    |
+--------------+
| 1180062632   |
| -1322734784  |
| 2096701320   |
| 2075007536   |
| -1970336592  |
| 1614574192   |
| 1592743936   |
| -1053691072  |
| -689805200   |
| 1893061072   |
| 1660328376   |
| 1852126136   |
| 1927731344   |
| 616840056    |
| -1997249184  |
| 1588717872   |
| 193019624    |
| 880839008    |
| 1879415496   |
| 1726850216   |
+--------------+
20 rows selected (0.311 seconds)

Doing a mod 56 only produces 1 distinct value, which indicates the skew:
0: jdbc:drill:drillbit=10.10.103.111> select distinct mod(hash32AsDouble(a, 
1301011), 56) from dfs.tmp.vv3 limit 20;
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
1 row selected (1.041 seconds)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to