Having an issue with some SequenceFiles that I generated, and I'm trying
to write a M/R job to fix them.
Situation is roughly this:
I have a bunch of directories in HDFS, each of which contains a set of 7
sequence files. Each sequence file is of a different "type", but the
key type is the same across all of the sequence files. The value types
- which are compressed - are also the same when in compressed form
(i.e., BytesWritable), though the different record types are obviously
different when uncompressed.
I want to write a job to fix some problems in the files. My thinking is
that I can feed all the data from all the files into a M/R job (i.e.,
gather), re-sort/partition the data properly, perform some additional
cleanup/fixup in the reducer, and then write the data back out to a new
set of files (i.e., scatter).
Been digging through the API's, and it looks like CombineFileInputFormat
/ CombineFileRecordReader might be the way to go here. It'd let me
merge the whole load of data from each of the (small) files into one M/R
job in an efficient way.
Sorting would then occur by key, as would partitioning, so I'm still
good so far.
Problem, however, is when I get to the reducer. The reducer needs to
know which type of file data (i.e., which type of source file) a record
came from so that it can a) uncompress/deserialize the data correctly,
and b) scatter it out to the correct type of output file.
I'm not entirely clear how to make this happen. It seems like the
source file information (which looks like it might exist on the
CombineFileSplit) is no longer available by the time it gets to the
reducer. And if the reducer doesn't know which file a given record came
from, it won't know how to process it properly.
Can anyone lend some suggestions on how to code this solution? Am I on
the right track with the CombineFileInputFormat /
CombineFileRecordReader approach? If so, then how might I make the
reducer code aware of the source of the record(s) it's currently processing?
TIA!
DR