Hi everyone,

I have a need to process a large number (millions) of relatively small XML 
files (I would imagine mostly 1K to 1 MB). I guess sending each file to a 
map-reduce task will cause too much overhead in setup and teardown of the 
tasks. So, I am considering 2 alternatives:

1) Generate 1 large SequenceFile with <K,V> = <Filename/URI, File XML content> 
for all the files. This SequenceFile would be huge. (I wonder is there any max 
record length / max size limit on HDFS file?)

2) Use CombineFileInputFormat

I would appreciate any comments on performance considerations and other pros 
and cons.

One more question: What if I have one large XML file composed of (a series of) 
each file's XML content, could it be possible to use StreamXMLReader while 
combining component files before sending to map-reduce task?

Thanks,
Rajiv


      

Reply via email to