Sorabh Hamirwasia created DRILL-5816: ----------------------------------------
Summary: Hash function produces skewed results on String values with same leading prefix Key: DRILL-5816 URL: https://issues.apache.org/jira/browse/DRILL-5816 Project: Apache Drill Issue Type: Bug Reporter: Sorabh Hamirwasia Assignee: Sorabh Hamirwasia Fix For: 1.12.0 Reported by [~amansinha100] Hashing of string values (for the hash exchange) could produce substantial skew for certain types of strings that have the same leading prefix. Here's the sample data: (note all strings begin with 'mscId=' followed by numeric values) 0: jdbc:drill:drillbit=10.10.103.111> select a from dfs.tmp.vv3 limit 20; +---------------------+ | a | +---------------------+ | mscId=100139170495 | | mscId=100103806655 | | mscId=100229137840 | | mscId=100362859440 | | mscId=100032583600 | | mscId=100125021360 | | mscId=100243775920 | | mscId=100152820405 | | mscId=100084724405 | | mscId=100297398970 | | mscId=100059560890 | | mscId=100106108090 | | mscId=100032092090 | | mscId=100029460410 | | mscId=100110390995 | | mscId=100019105235 | | mscId=100354644435 | | mscId=100288523475 | | mscId=100214507475 | | mscId=100296418515 | +---------------------+ 20 rows selected (0.33 seconds) Here's the hash values using the hash function that Drill uses for the HashToRandomExchange (note that they are all even numbers): 0: jdbc:drill:drillbit=10.10.103.111> select hash32AsDouble(a, 1301011) from dfs.tmp.vv3 limit 20; +--------------+ | EXPR$0 | +--------------+ | 1180062632 | | -1322734784 | | 2096701320 | | 2075007536 | | -1970336592 | | 1614574192 | | 1592743936 | | -1053691072 | | -689805200 | | 1893061072 | | 1660328376 | | 1852126136 | | 1927731344 | | 616840056 | | -1997249184 | | 1588717872 | | 193019624 | | 880839008 | | 1879415496 | | 1726850216 | +--------------+ 20 rows selected (0.311 seconds) Doing a mod 56 only produces 1 distinct value, which indicates the skew: 0: jdbc:drill:drillbit=10.10.103.111> select distinct mod(hash32AsDouble(a, 1301011), 56) from dfs.tmp.vv3 limit 20; +---------+ | EXPR$0 | +---------+ | 0 | +---------+ 1 row selected (1.041 seconds) -- This message was sent by Atlassian JIRA (v6.4.14#64029)