[
https://issues.apache.org/jira/browse/CRUNCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561823#comment-14561823
]
Brandon Vargo commented on CRUNCH-528:
--------------------------------------
Ahh, makes sense. I thought you were only going to replace the == with equals
in the first line of cmp, my apologies.
I could see a case where a join would have interleaving keys in the case of a
hash collision, due to the arbitrary ordering, so the cross-product would not
be complete. I went looking for a way to serialize the keys using the
underlying type (e.g. Writable) in order to break the tie in a more
deterministic way, but I didn't see an easy way of doing so in a way that makes
sense. At least now (k1, v1) and (k2, v2) won't get joined together if hash(k1)
== hash(k2).
Short of there being another tie-breaker that I am missing, the patch looks
good to me. Thanks!
> Pair: Integer overflow during comparison can cause inconsistent sort.
> ---------------------------------------------------------------------
>
> Key: CRUNCH-528
> URL: https://issues.apache.org/jira/browse/CRUNCH-528
> Project: Crunch
> Issue Type: Bug
> Components: Core
> Reporter: Brandon Vargo
> Assignee: Josh Wills
> Priority: Minor
> Attachments: 0001-Pair-Fix-comparison-for-large-hash-codes.patch,
> CRUNCH-528.2.patch
>
>
> Pair uses the hash code of each value for comparison if the values are not
> themselves comparable. If the hash code values are too large, then the values
> will wrap when doing subtraction. This results in a comparison function that
> is not transitive.
> Among other things, this makes Joins using the in-memory pipeline not work,
> since the in-memory shuffler uses a TreeMap if the key type is Comparable.
> Since the key in a join is a Pair of the original key and a join tag, the key
> is always comparable. With a non-transitive comparison function, it is
> possible for the two join tags of the original key to sort differently,
> resulting in the two join tags not being adjacent for the original key. This
> results either in either the cross product erroneously producing no values in
> the case of an inner join, since the two join tags are not adjacent, or null
> values appearing when they should not in the case of an outer join.
> As a workaround, ensure that the key used in a Join is comparable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)