Hello, I am writing a little hadoop program to index a bunch (large bunch) of text files joined together in a large xml file. The mapper execute some basic text preprocessing and emits key-value pair like:
(term,document_id) -> (section_of_the_document,positional frequency vector) example (apple,12) -> (title,[1,3]) The reducer should bring together the same terms and create a posting list like: apple -> (12,title,[1,3]) , (14,body,[2,5]) ... ... -> ... To accomplish this I have created a custom class PairOfStringInt to hold mapper's key which implements writableComparable, a custom partitioner TermPartioner (https://gist.github.com/809793) and a Reducer which should bring all values from the same key[1] into the same posting list as in the example. Testing my system on a tiny dataset made up of two document (same content) I get: minni [(1,body,[1,2])] pippo [(1,body,[2,0,3])] pluto [(1,body,[1,1])] minni [(2,body,[1,2])] pippo [(2,body,[1,0])] pluto [(2,body,[1,1])] The values from the same key are not brought together...Looking at the secondary sort example I also tried to implement a GroupComparator(https://gist.github.com/809803) to be set on the job using job.setGroupingComparatorClass(GroupingComparator.class) but if I do so I get in the output: minni [(1,body,[1,2])],[(1,body,[2,0,3])],[(1,body,[1,1])],[(2,body,[1,2])],[(2,body,[1,0])],[(2,body,[1,1])] One single key (the first one) and all postings associated with it...what do I miss?? Thanks for your time Marco [1] by "same key" I mean those who have the same left element
