[jira] [Commented] (HADOOP-12217) hashCode in DoubleWritable returns same value for many numbers

Steve Scaffidi (JIRA) Sat, 11 Jul 2015 09:51:14 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623484#comment-14623484
 ]


Steve Scaffidi commented on HADOOP-12217:
-----------------------------------------

Looks like the MapJoinKeyBytes class was removed as part of HIVE-9331 (git 
commit c8ba0f96, 2015-01-15). In my testing, I found Cloudera's distro of Hive 
1.1 was using MapJoinKeyObject, which makes sense, but I've been looking 
through both their patched Hive code as well as master/trunk from upstream and 
I don't see anything significant they changed from upstream that's related to 
this.


I'm still trying to suss out another part of the issue that led me to finding 
the bug I reported here: In my affected Hive queries, which join a STRING 
column (from the large table) with an INT column (from the small table that is 
used for the mapjoin hashtable), Hive is converting the STRING and the INT into 
DOUBLE for the purpose of the join, which, AFAICT, is a change in behavior 
since Hive 0.13. Because the values I'm joining on are all fairly small 
integers (about 160,000 values, ranging from 1 to 999,999), the bad hashCode 
implementation for DoubleWritable causes the HashMap Hive builds in the local 
task to degenerate into a linked-list that is both exceedingly slow to build as 
well as load in the subsequent map tasks. :(

On the other hand, the conversion to a DOUBLE to do the comparison makes sense 
given the table of implicit conversions in the documentation - it seems to me 
that the old behavior must have been incorrect and has since been "fixed" :) 
Unfortunately I have too many users with too many queries that depend on the 
performance of the old behavior - it's easier for me to patch Hadoop or Hive!

Once I figure out where/why Hive's behavior changed, I'll file a ticket there, 
too, if necessary, hopefully with useful patches :)

> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>
>                 Key: HADOOP-12217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.20.0, 
> 0.20.1, 0.20.2, 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0, 1.0.1, 1.0.2, 
> 1.0.3, 1.0.4, 1.1.0, 1.1.1, 1.2.0, 0.21.0, 0.22.0, 0.23.0, 0.23.1, 0.23.3, 
> 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 0.23.4, 2.0.3-alpha, 0.23.5, 0.23.6, 
> 1.1.2, 0.23.7, 2.1.0-beta, 2.0.4-alpha, 0.23.8, 1.2.1, 2.0.5-alpha, 0.23.9, 
> 0.23.10, 0.23.11, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.4.1, 
> 2.5.1, 2.5.2, 2.6.0, 2.7.0, 2.7.1
>            Reporter: Steve Scaffidi
>              Labels: easyfix
>
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the 
> keys in a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin 
> queries were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable 
> that overrode used to override hashCode() with a correct implementation, but 
> for some reason they recently removed that code, so it now uses the incorrect 
> hashCode() method inherited from Hadoop's DoubleWritable.
> It appears that this bug has been there since DoubleWritable was 
> created(wow!) so I can understand if fixing it is impractical due to the 
> possibility of breaking things down-stream, but I can't think of anything 
> that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some 
> historical perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, 
> HIVE-7041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-12217) hashCode in DoubleWritable returns same value for many numbers

Reply via email to