scatter issue

David Rosenstrauch Tue, 07 Dec 2010 15:43:39 -0800

Having an issue with some SequenceFiles that I generated, and I'm tryingto write a M/R job to fix them.


Situation is roughly this:

I have a bunch of directories in HDFS, each of which contains a set of 7sequence files. Each sequence file is of a different "type", but thekey type is the same across all of the sequence files. The value types- which are compressed - are also the same when in compressed form(i.e., BytesWritable), though the different record types are obviouslydifferent when uncompressed.

I want to write a job to fix some problems in the files. My thinking isthat I can feed all the data from all the files into a M/R job (i.e.,gather), re-sort/partition the data properly, perform some additionalcleanup/fixup in the reducer, and then write the data back out to a newset of files (i.e., scatter).

Been digging through the API's, and it looks like CombineFileInputFormat/ CombineFileRecordReader might be the way to go here. It'd let memerge the whole load of data from each of the (small) files into one M/Rjob in an efficient way.

Sorting would then occur by key, as would partitioning, so I'm stillgood so far.

Problem, however, is when I get to the reducer. The reducer needs toknow which type of file data (i.e., which type of source file) a recordcame from so that it can a) uncompress/deserialize the data correctly,and b) scatter it out to the correct type of output file.

I'm not entirely clear how to make this happen. It seems like thesource file information (which looks like it might exist on theCombineFileSplit) is no longer available by the time it gets to thereducer. And if the reducer doesn't know which file a given record camefrom, it won't know how to process it properly.

Can anyone lend some suggestions on how to code this solution? Am I onthe right track with the CombineFileInputFormat /CombineFileRecordReader approach? If so, then how might I make thereducer code aware of the source of the record(s) it's currently processing?


TIA!

DR

M/R file gather/scatter issue

Reply via email to