Pony, Keys have got to be compared by the MR framework somehow, and the way it does when you use Writables is by ensuring that your Key is of a Writable + Comparable type (WritableComparable).
If you specify a specific comparator class, then that will be used; else the default WritableComparator will get asked if it can supply a comparator for use with your key type. AFAIK, the default WritableComparator wraps around RawComparator and does indeed deserialize the writables before applying the compare operation. The RawComparator's primary idea is to give you a pair of raw byte sequences to compare directly. Certain other serialization libraries (Apache Avro is one) provide ways to compare using bytes itself (Across different types), which can end up being faster when used in jobs. Hope this clears up your confusion. On Tue, May 24, 2011 at 2:06 AM, Juan P. <[email protected]> wrote: > Hi guys, > I wanted to get your help with a couple of questions which came up while > looking at the Hadoop Comparator/Comparable architecture. > > As I see it before each reducer operates on each key, a sorting algorithm is > applied to them. *Why does Hadoop need to do that?* > > If I implement my own class and I intend to use it as a Key I must allow for > instances of my class to be compared. So I have 2 choices: I can implement > WritableComparable or I can register a WritableComparator for my > class. Should I fail to do either, would the Job fail? > If I register my WritableComparator which does not use the Comparable > interface at all, does my Key need to implement WritableComparable? > If I don't implement my Comparator and my Key implements WritableComparable, > does it mean that Hadoop will deserialize my Keys twice? (once for sorting, > and once for reducing) > What is RawComparable used for? > > Thanks for your help! > Pony > -- Harsh J
