Having an issue with some SequenceFiles that I generated, and I'm trying to write a M/R job to fix them.

Situation is roughly this:

I have a bunch of directories in HDFS, each of which contains a set of 7 sequence files. Each sequence file is of a different "type", but the key type is the same across all of the sequence files. The value types - which are compressed - are also the same when in compressed form (i.e., BytesWritable), though the different record types are obviously different when uncompressed.

I want to write a job to fix some problems in the files. My thinking is that I can feed all the data from all the files into a M/R job (i.e., gather), re-sort/partition the data properly, perform some additional cleanup/fixup in the reducer, and then write the data back out to a new set of files (i.e., scatter).


Been digging through the API's, and it looks like CombineFileInputFormat / CombineFileRecordReader might be the way to go here. It'd let me merge the whole load of data from each of the (small) files into one M/R job in an efficient way.

Sorting would then occur by key, as would partitioning, so I'm still good so far.

Problem, however, is when I get to the reducer. The reducer needs to know which type of file data (i.e., which type of source file) a record came from so that it can a) uncompress/deserialize the data correctly, and b) scatter it out to the correct type of output file.

I'm not entirely clear how to make this happen. It seems like the source file information (which looks like it might exist on the CombineFileSplit) is no longer available by the time it gets to the reducer. And if the reducer doesn't know which file a given record came from, it won't know how to process it properly.

Can anyone lend some suggestions on how to code this solution? Am I on the right track with the CombineFileInputFormat / CombineFileRecordReader approach? If so, then how might I make the reducer code aware of the source of the record(s) it's currently processing?

TIA!

DR

Reply via email to