[ 
https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Scaffidi updated HADOOP-12217:
------------------------------------
    Status: Patch Available  (was: Open)

This is the simplest fix that does not create a Double object to calculate a 
correct hashCode. I have not yet tested this in a production-level environment, 
though. I can add some tests to show effectiveness of the hashCode distribution 
if desired.

> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>
>                 Key: HADOOP-12217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 2.7.1, 2.7.0, 2.6.0, 2.5.2, 2.5.1, 2.4.1, 2.5.0, 2.4.0, 
> 2.3.0, 2.2.0, 2.0.6-alpha, 2.1.1-beta, 0.23.11, 0.23.10, 0.23.9, 2.0.5-alpha, 
> 1.2.1, 0.23.8, 2.0.4-alpha, 2.1.0-beta, 0.23.7, 1.1.2, 0.23.6, 0.23.5, 
> 2.0.3-alpha, 0.23.4, 2.0.2-alpha, 2.0.1-alpha, 2.0.0-alpha, 0.23.3, 0.23.1, 
> 0.23.0, 0.22.0, 0.21.0, 1.2.0, 1.1.1, 1.1.0, 1.0.4, 1.0.3, 1.0.2, 1.0.1, 
> 1.0.0, 0.20.205.0, 0.20.204.0, 0.20.203.0, 0.20.2, 0.20.1, 0.20.0, 0.19.1, 
> 0.19.0, 0.18.3, 0.18.2, 0.18.1, 0.18.0
>            Reporter: Steve Scaffidi
>              Labels: easyfix
>
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the 
> keys in a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin 
> queries were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable 
> that overrode used to override hashCode() with a correct implementation, but 
> for some reason they recently removed that code, so it now uses the incorrect 
> hashCode() method inherited from Hadoop's DoubleWritable.
> It appears that this bug has been there since DoubleWritable was 
> created(wow!) so I can understand if fixing it is impractical due to the 
> possibility of breaking things down-stream, but I can't think of anything 
> that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some 
> historical perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, 
> HIVE-7041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to