Seems like CombineFileInputFormat.createPool() might help here. But I'm
a little unclear on usage. That method is protected ... and so then I
guess only accessible by subclasses?
Can anyone advise on usage here?
Thanks,
DR
On 12/08/2010 11:25 AM, David Rosenstrauch wrote:
Bit of a snag here:
Since I'm thinking this app needs to use CombineFileInputFormat (since
lots of small files) this throws a wrench into these plans a bit.
CombineFileInputFormat creates CombineFileSplit's, not FileSplit's. And
CombineFileSplit only contains a list of all the file paths whose data
is included in the split, but no way to identify which file path a
particular record came from.
Any workaround here?
Thanks,
DR
On 12/07/2010 11:08 PM, David Rosenstrauch wrote:
Thanks for the suggestion Shrijeet.
Same thought occurred to me on the way home from work after I sent this
mail. Not sure why, but my brain was kinda locked onto the concept of
the mapper being a no-op in this situation. Obviously doesn't have to be.
Let me try hacking this together and see how it goes. Thanks again much
for helping clarify my thinking.
DR
On 12/07/2010 07:02 PM, Shrijeet Paliwal wrote:
ammm, how about modifying the key that you collect in the mapper to
include some *additional* information (like filename) to hint reducer
about records origin?
-Shrijeet
On Tue, Dec 7, 2010 at 3:43 PM, David Rosenstrauch<[email protected]>
wrote:
Having an issue with some SequenceFiles that I generated, and I'm
trying to
write a M/R job to fix them.
Situation is roughly this:
I have a bunch of directories in HDFS, each of which contains a set
of 7
sequence files. Each sequence file is of a different "type", but the
key
type is the same across all of the sequence files. The value types -
which
are compressed - are also the same when in compressed form (i.e.,
BytesWritable), though the different record types are obviously
different
when uncompressed.
I want to write a job to fix some problems in the files. My thinking is
that I can feed all the data from all the files into a M/R job (i.e.,
gather), re-sort/partition the data properly, perform some additional
cleanup/fixup in the reducer, and then write the data back out to a
new set
of files (i.e., scatter).
Been digging through the API's, and it looks like
CombineFileInputFormat /
CombineFileRecordReader might be the way to go here. It'd let me
merge the
whole load of data from each of the (small) files into one M/R job
in an
efficient way.
Sorting would then occur by key, as would partitioning, so I'm still
good so
far.
Problem, however, is when I get to the reducer. The reducer needs to
know
which type of file data (i.e., which type of source file) a record
came from
so that it can a) uncompress/deserialize the data correctly, and b)
scatter
it out to the correct type of output file.
I'm not entirely clear how to make this happen. It seems like the
source
file information (which looks like it might exist on the
CombineFileSplit)
is no longer available by the time it gets to the reducer. And if the
reducer doesn't know which file a given record came from, it won't
know how
to process it properly.
Can anyone lend some suggestions on how to code this solution? Am I
on the
right track with the CombineFileInputFormat / CombineFileRecordReader
approach? If so, then how might I make the reducer code aware of the
source
of the record(s) it's currently processing?
TIA!
DR