Steve Scaffidi created HADOOP-12217:
---------------------------------------
Summary: hashCode in DoubleWritable returns same value for many
numbers
Key: HADOOP-12217
URL: https://issues.apache.org/jira/browse/HADOOP-12217
Project: Hadoop Common
Issue Type: Bug
Components: io
Affects Versions: 2.7.1, 2.7.0, 2.6.0, 2.5.2, 2.5.1, 2.4.1, 2.5.0, 2.4.0,
2.3.0, 2.2.0, 2.0.6-alpha, 2.1.1-beta, 0.23.11, 0.23.10, 0.23.9, 2.0.5-alpha,
1.2.1, 0.23.8, 2.0.4-alpha, 2.1.0-beta, 0.23.7, 1.1.2, 0.23.6, 0.23.5,
2.0.3-alpha, 0.23.4, 2.0.2-alpha, 2.0.1-alpha, 2.0.0-alpha, 0.23.3, 0.23.1,
0.23.0, 0.22.0, 0.21.0, 1.2.0, 1.1.1, 1.1.0, 1.0.4, 1.0.3, 1.0.2, 1.0.1, 1.0.0,
0.20.205.0, 0.20.204.0, 0.20.203.0, 0.20.2, 0.20.1, 0.20.0, 0.19.1, 0.19.0,
0.18.3, 0.18.2, 0.18.1, 0.18.0
Reporter: Steve Scaffidi
Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the
keys in a HashMap results in abysmal performance, due to hash code collisions.
I discovered this when testing the latest version of Hive and certain mapjoin
queries were exceedingly slow.
Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable
that overrode used to override hashCode() with a correct implementation, but
for some reason they recently removed that code, so it now uses the incorrect
hashCode() method inherited from Hadoop's DoubleWritable.
It appears that this bug has been there since DoubleWritable was created(!) so
I can understand if fixing it is impractical due to the possibility of breaking
things down-stream, but I can't think of anything that *should* break, off the
top of my head.
Searching JIRA, I found several related tickets, which may be useful for some
historical perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, HIVE-7041
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)