[ https://issues.apache.org/jira/browse/CRUNCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561823#comment-14561823 ]
Brandon Vargo commented on CRUNCH-528: -------------------------------------- Ahh, makes sense. I thought you were only going to replace the == with equals in the first line of cmp, my apologies. I could see a case where a join would have interleaving keys in the case of a hash collision, due to the arbitrary ordering, so the cross-product would not be complete. I went looking for a way to serialize the keys using the underlying type (e.g. Writable) in order to break the tie in a more deterministic way, but I didn't see an easy way of doing so in a way that makes sense. At least now (k1, v1) and (k2, v2) won't get joined together if hash(k1) == hash(k2). Short of there being another tie-breaker that I am missing, the patch looks good to me. Thanks! > Pair: Integer overflow during comparison can cause inconsistent sort. > --------------------------------------------------------------------- > > Key: CRUNCH-528 > URL: https://issues.apache.org/jira/browse/CRUNCH-528 > Project: Crunch > Issue Type: Bug > Components: Core > Reporter: Brandon Vargo > Assignee: Josh Wills > Priority: Minor > Attachments: 0001-Pair-Fix-comparison-for-large-hash-codes.patch, > CRUNCH-528.2.patch > > > Pair uses the hash code of each value for comparison if the values are not > themselves comparable. If the hash code values are too large, then the values > will wrap when doing subtraction. This results in a comparison function that > is not transitive. > Among other things, this makes Joins using the in-memory pipeline not work, > since the in-memory shuffler uses a TreeMap if the key type is Comparable. > Since the key in a join is a Pair of the original key and a join tag, the key > is always comparable. With a non-transitive comparison function, it is > possible for the two join tags of the original key to sort differently, > resulting in the two join tags not being adjacent for the original key. This > results either in either the cross product erroneously producing no values in > the case of an inner join, since the two join tags are not adjacent, or null > values appearing when they should not in the case of an outer join. > As a workaround, ensure that the key used in a Join is comparable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)