For a ValueGrouping comparator to work, your Partitioner must act in
tandem with it. I do not know if you have implemented a custom
hashCode() method for your Key class, but your partitioner should look
like:

return (key.getLeftElement().hashCode() & Integer.MAX_VALUE) % numPartitions;

This will ensure that the to-be grouped data is actually partitioned
properly too.

The actual sorting (which ought to occur for the full composite key
field-by-field, and is the only real 'sorter') would be handled by the
compare() call of your Writable, if you are using a
WritableComparable.

On Thu, Feb 3, 2011 at 10:51 PM, Marco Didonna <[email protected]> wrote:
> Hello,
> I am writing a little hadoop program to index a bunch (large bunch) of
> text files joined together in a large xml file. The mapper execute some
> basic text preprocessing and emits key-value pair like:
>
> (term,document_id) -> (section_of_the_document,positional frequency vector)
>
> example
>
> (apple,12) -> (title,[1,3])
>
> The reducer should bring together the same terms and create a posting
> list like:
>
> apple -> (12,title,[1,3]) , (14,body,[2,5]) ...
>
> ... -> ...
>
> To accomplish this I have created a custom class PairOfStringInt to hold
> mapper's key which implements writableComparable, a custom partitioner
> TermPartioner (https://gist.github.com/809793) and a Reducer which
> should bring all values from the same key[1] into the same posting list
> as in the example.
>
> Testing my system on a tiny dataset made up of two document (same
> content) I get:
>
> minni   [(1,body,[1,2])]
> pippo   [(1,body,[2,0,3])]
> pluto   [(1,body,[1,1])]
> minni   [(2,body,[1,2])]
> pippo   [(2,body,[1,0])]
> pluto   [(2,body,[1,1])]
>
> The values from the same key are not brought together...Looking at the
> secondary sort example I also tried to implement a
> GroupComparator(https://gist.github.com/809803) to be set on the job
> using job.setGroupingComparatorClass(GroupingComparator.class) but if I
> do so I get in the output:
>
> minni
> [(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])]
>
>
> One single key (the first one) and all postings associated with
> it...what do I miss??
>
> Thanks for your time
>
> Marco
>
> [1] by "same key" I mean those who have the same left element
>
>



-- 
Harsh J
www.harshj.com

Reply via email to