This sounds like the perennial problem of hash functions vs comparators. These absolutely do have to be consistent. If you change the comparator, then you have to change the partitioner to be consistent. This is no different from the fact that changing the equals function on a Java object may imply a corresponding change in the hash function.
But, yes, it is a really easy way to get some very subtle bugs. On 10/15/07 7:10 AM, "Chris Dyer" <[EMAIL PROTECTED]> wrote: > Hi all-- > > I had a subtle bug in a MapReduce job that I was working on related to > the fact that my custom key type's hash type (used by the default > HashPartitioner) was putting elements that belonged together according > to a custom output value grouping comparator into different bins. > Currently, I've found a single hash function that will hash everything > correctly for all the comparators that I'm using without too many > collisions. But, this will not be generally possible for all > applications, so I wanted to ask what the best practices regarding > this should be. The only real possibility I currently see with the > API as it is (I may be mistaken) is to write a custom Partitioner. > Did I miss something? > > I'd like to ask if this might be a design bug. For the default > partioner (HashPartioner), there is a dependency between the > hashCode() of the key type and the compare function that is being used > on the keys. The problem is that it is possible to override the > compare function by specifying a custom comparator, but it is not > possible to override the hashCode function. This basically means that > any time you are specifying a custom comparator, you need to change > your partitioner so that you can effectively override the hashCode, > albeit indirectly. This is irritating, and if true, the API doesn't > give any indication that you should do this, ie, but providing a > setOutputValueComparatorAndPartitioner function instead of the two > separate function. > > Has anyone else encountered this? > > Thanks, > Chris
