Reducer getting key-value pairs in wrong order

Marco Didonna Thu, 03 Feb 2011 09:22:19 -0800

Hello,
I am writing a little hadoop program to index a bunch (large bunch) of
text files joined together in a large xml file. The mapper execute some
basic text preprocessing and emits key-value pair like:


(term,document_id) -> (section_of_the_document,positional frequency vector)

example

(apple,12) -> (title,[1,3])

The reducer should bring together the same terms and create a posting
list like:

apple -> (12,title,[1,3]) , (14,body,[2,5]) ...

... -> ...

To accomplish this I have created a custom class PairOfStringInt to hold
mapper's key which implements writableComparable, a custom partitioner
TermPartioner (https://gist.github.com/809793) and a Reducer which
should bring all values from the same key[1] into the same posting list
as in the example.

Testing my system on a tiny dataset made up of two document (same
content) I get:

minni   [(1,body,[1,2])]
pippo   [(1,body,[2,0,3])]
pluto   [(1,body,[1,1])]
minni   [(2,body,[1,2])]
pippo   [(2,body,[1,0])]
pluto   [(2,body,[1,1])]

The values from the same key are not brought together...Looking at the
secondary sort example I also tried to implement a
GroupComparator(https://gist.github.com/809803) to be set on the job
using job.setGroupingComparatorClass(GroupingComparator.class) but if I
do so I get in the output:

minni
[(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])]


One single key (the first one) and all postings associated with
it...what do I miss??

Thanks for your time

Marco

[1] by "same key" I mean those who have the same left element

Reducer getting key-value pairs in wrong order

Reply via email to