[
https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623484#comment-14623484
]
Steve Scaffidi commented on HADOOP-12217:
-----------------------------------------
Looks like the MapJoinKeyBytes class was removed as part of HIVE-9331 (git
commit c8ba0f96, 2015-01-15). In my testing, I found Cloudera's distro of Hive
1.1 was using MapJoinKeyObject, which makes sense, but I've been looking
through both their patched Hive code as well as master/trunk from upstream and
I don't see anything significant they changed from upstream that's related to
this.
I'm still trying to suss out another part of the issue that led me to finding
the bug I reported here: In my affected Hive queries, which join a STRING
column (from the large table) with an INT column (from the small table that is
used for the mapjoin hashtable), Hive is converting the STRING and the INT into
DOUBLE for the purpose of the join, which, AFAICT, is a change in behavior
since Hive 0.13. Because the values I'm joining on are all fairly small
integers (about 160,000 values, ranging from 1 to 999,999), the bad hashCode
implementation for DoubleWritable causes the HashMap Hive builds in the local
task to degenerate into a linked-list that is both exceedingly slow to build as
well as load in the subsequent map tasks. :(
On the other hand, the conversion to a DOUBLE to do the comparison makes sense
given the table of implicit conversions in the documentation - it seems to me
that the old behavior must have been incorrect and has since been "fixed" :)
Unfortunately I have too many users with too many queries that depend on the
performance of the old behavior - it's easier for me to patch Hadoop or Hive!
Once I figure out where/why Hive's behavior changed, I'll file a ticket there,
too, if necessary, hopefully with useful patches :)
> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>
> Key: HADOOP-12217
> URL: https://issues.apache.org/jira/browse/HADOOP-12217
> Project: Hadoop Common
> Issue Type: Bug
> Components: io
> Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.20.0,
> 0.20.1, 0.20.2, 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0, 1.0.1, 1.0.2,
> 1.0.3, 1.0.4, 1.1.0, 1.1.1, 1.2.0, 0.21.0, 0.22.0, 0.23.0, 0.23.1, 0.23.3,
> 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 0.23.4, 2.0.3-alpha, 0.23.5, 0.23.6,
> 1.1.2, 0.23.7, 2.1.0-beta, 2.0.4-alpha, 0.23.8, 1.2.1, 2.0.5-alpha, 0.23.9,
> 0.23.10, 0.23.11, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.4.1,
> 2.5.1, 2.5.2, 2.6.0, 2.7.0, 2.7.1
> Reporter: Steve Scaffidi
> Labels: easyfix
>
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the
> keys in a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin
> queries were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable
> that overrode used to override hashCode() with a correct implementation, but
> for some reason they recently removed that code, so it now uses the incorrect
> hashCode() method inherited from Hadoop's DoubleWritable.
> It appears that this bug has been there since DoubleWritable was
> created(wow!) so I can understand if fixing it is impractical due to the
> possibility of breaking things down-stream, but I can't think of anything
> that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some
> historical perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629,
> HIVE-7041
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)