[
https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Scaffidi updated HADOOP-12217:
------------------------------------
Status: Patch Available (was: Open)
This is the simplest fix that does not create a Double object to calculate a
correct hashCode. I have not yet tested this in a production-level environment,
though. I can add some tests to show effectiveness of the hashCode distribution
if desired.
> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>
> Key: HADOOP-12217
> URL: https://issues.apache.org/jira/browse/HADOOP-12217
> Project: Hadoop Common
> Issue Type: Bug
> Components: io
> Affects Versions: 2.7.1, 2.7.0, 2.6.0, 2.5.2, 2.5.1, 2.4.1, 2.5.0, 2.4.0,
> 2.3.0, 2.2.0, 2.0.6-alpha, 2.1.1-beta, 0.23.11, 0.23.10, 0.23.9, 2.0.5-alpha,
> 1.2.1, 0.23.8, 2.0.4-alpha, 2.1.0-beta, 0.23.7, 1.1.2, 0.23.6, 0.23.5,
> 2.0.3-alpha, 0.23.4, 2.0.2-alpha, 2.0.1-alpha, 2.0.0-alpha, 0.23.3, 0.23.1,
> 0.23.0, 0.22.0, 0.21.0, 1.2.0, 1.1.1, 1.1.0, 1.0.4, 1.0.3, 1.0.2, 1.0.1,
> 1.0.0, 0.20.205.0, 0.20.204.0, 0.20.203.0, 0.20.2, 0.20.1, 0.20.0, 0.19.1,
> 0.19.0, 0.18.3, 0.18.2, 0.18.1, 0.18.0
> Reporter: Steve Scaffidi
> Labels: easyfix
>
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the
> keys in a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin
> queries were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable
> that overrode used to override hashCode() with a correct implementation, but
> for some reason they recently removed that code, so it now uses the incorrect
> hashCode() method inherited from Hadoop's DoubleWritable.
> It appears that this bug has been there since DoubleWritable was
> created(wow!) so I can understand if fixing it is impractical due to the
> possibility of breaking things down-stream, but I can't think of anything
> that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some
> historical perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629,
> HIVE-7041
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)