[ 
https://issues.apache.org/jira/browse/CRUNCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561823#comment-14561823
 ] 

Brandon Vargo commented on CRUNCH-528:
--------------------------------------

Ahh, makes sense. I thought you were only going to replace the == with equals 
in the first line of cmp, my apologies.

I could see a case where a join would have interleaving keys in the case of a 
hash collision, due to the arbitrary ordering, so the cross-product would not 
be complete. I went looking for a way to serialize the keys using the 
underlying type (e.g. Writable) in order to break the tie in a more 
deterministic way, but I didn't see an easy way of doing so in a way that makes 
sense. At least now (k1, v1) and (k2, v2) won't get joined together if hash(k1) 
== hash(k2).

Short of there being another tie-breaker that I am missing, the patch looks 
good to me. Thanks!

> Pair: Integer overflow during comparison can cause inconsistent sort.
> ---------------------------------------------------------------------
>
>                 Key: CRUNCH-528
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-528
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Vargo
>            Assignee: Josh Wills
>            Priority: Minor
>         Attachments: 0001-Pair-Fix-comparison-for-large-hash-codes.patch, 
> CRUNCH-528.2.patch
>
>
> Pair uses the hash code of each value for comparison if the values are not 
> themselves comparable. If the hash code values are too large, then the values 
> will wrap when doing subtraction. This results in a comparison function that 
> is not transitive.
> Among other things, this makes Joins using the in-memory pipeline not work, 
> since the in-memory shuffler uses a TreeMap if the key type is Comparable. 
> Since the key in a join is a Pair of the original key and a join tag, the key 
> is always comparable. With a non-transitive comparison function, it is 
> possible for the two join tags of the original key to sort differently, 
> resulting in the two join tags not being adjacent for the original key. This 
> results either in either the cross product erroneously producing no values in 
> the case of an inner join, since the two join tags are not adjacent, or null 
> values appearing when they should not in the case of an outer join.
> As a workaround, ensure that the key used in a Join is comparable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to