Hi everyone,
I have a need to process a large number (millions) of relatively small XML
files (I would imagine mostly 1K to 1 MB). I guess sending each file to a
map-reduce task will cause too much overhead in setup and teardown of the
tasks. So, I am considering 2 alternatives:
1) Generate 1 large SequenceFile with <K,V> = <Filename/URI, File XML content>
for all the files. This SequenceFile would be huge. (I wonder is there any max
record length / max size limit on HDFS file?)
2) Use CombineFileInputFormat
I would appreciate any comments on performance considerations and other pros
and cons.
One more question: What if I have one large XML file composed of (a series of)
each file's XML content, could it be possible to use StreamXMLReader while
combining component files before sending to map-reduce task?
Thanks,
Rajiv