For a ValueGrouping comparator to work, your Partitioner must act in tandem with it. I do not know if you have implemented a custom hashCode() method for your Key class, but your partitioner should look like:
return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions; This will ensure that the to-be grouped data is actually partitioned properly too. The actual sorting (which ought to occur for the full composite key field-by-field, and is the only real 'sorter') would be handled by the compare() call of your Writable, if you are using a WritableComparable. On Thu, Feb 3, 2011 at 10:51 PM, Marco Didonna <[email protected]> wrote: > Hello, > I am writing a little hadoop program to index a bunch (large bunch) of > text files joined together in a large xml file. The mapper execute some > basic text preprocessing and emits key-value pair like: > > (term,document_id) -> (section_of_the_document,positional frequency vector) > > example > > (apple,12) -> (title,[1,3]) > > The reducer should bring together the same terms and create a posting > list like: > > apple -> (12,title,[1,3]) , (14,body,[2,5]) ... > > ... -> ... > > To accomplish this I have created a custom class PairOfStringInt to hold > mapper's key which implements writableComparable, a custom partitioner > TermPartioner (https://gist.github.com/809793) and a Reducer which > should bring all values from the same key[1] into the same posting list > as in the example. > > Testing my system on a tiny dataset made up of two document (same > content) I get: > > minni [(1,body,[1,2])] > pippo [(1,body,[2,0,3])] > pluto [(1,body,[1,1])] > minni [(2,body,[1,2])] > pippo [(2,body,[1,0])] > pluto [(2,body,[1,1])] > > The values from the same key are not brought together...Looking at the > secondary sort example I also tried to implement a > GroupComparator(https://gist.github.com/809803) to be set on the job > using job.setGroupingComparatorClass(GroupingComparator.class) but if I > do so I get in the output: > > minni > [(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])] > > > One single key (the first one) and all postings associated with > it...what do I miss?? > > Thanks for your time > > Marco > > [1] by "same key" I mean those who have the same left element > > -- Harsh J www.harshj.com
