ammm, how about modifying the key that you collect in the mapper to include some *additional* information (like filename) to hint reducer about records origin?
-Shrijeet On Tue, Dec 7, 2010 at 3:43 PM, David Rosenstrauch <[email protected]> wrote: > Having an issue with some SequenceFiles that I generated, and I'm trying to > write a M/R job to fix them. > > Situation is roughly this: > > I have a bunch of directories in HDFS, each of which contains a set of 7 > sequence files. Each sequence file is of a different "type", but the key > type is the same across all of the sequence files. The value types - which > are compressed - are also the same when in compressed form (i.e., > BytesWritable), though the different record types are obviously different > when uncompressed. > > I want to write a job to fix some problems in the files. My thinking is > that I can feed all the data from all the files into a M/R job (i.e., > gather), re-sort/partition the data properly, perform some additional > cleanup/fixup in the reducer, and then write the data back out to a new set > of files (i.e., scatter). > > > Been digging through the API's, and it looks like CombineFileInputFormat / > CombineFileRecordReader might be the way to go here. It'd let me merge the > whole load of data from each of the (small) files into one M/R job in an > efficient way. > > Sorting would then occur by key, as would partitioning, so I'm still good so > far. > > Problem, however, is when I get to the reducer. The reducer needs to know > which type of file data (i.e., which type of source file) a record came from > so that it can a) uncompress/deserialize the data correctly, and b) scatter > it out to the correct type of output file. > > I'm not entirely clear how to make this happen. It seems like the source > file information (which looks like it might exist on the CombineFileSplit) > is no longer available by the time it gets to the reducer. And if the > reducer doesn't know which file a given record came from, it won't know how > to process it properly. > > Can anyone lend some suggestions on how to code this solution? Am I on the > right track with the CombineFileInputFormat / CombineFileRecordReader > approach? If so, then how might I make the reducer code aware of the source > of the record(s) it's currently processing? > > TIA! > > DR >
