We have a similar setup of many smaller files. We found that using a combined file input (64mb of files are bundled as input to a single map job) in conjunction with compressed files on the HDFS much reduced the overhead of the map jobs. I haven't compared distributing ZIP/TAR files on the HDFS versus using the combined file input ... I assume the combined file input is efficient but have not measured versus explicitly bundling the input files.



Andrzej Jan Taramina wrote:
I'm new to Hadoop, so pardon the potentially dumb question....

I've gathered, from much research, that Hadoop is not always a good choice when 
you need to process a whack of smaller
files, which is what we need to do.

More specifically, we need to start by processing about 250K XML files, each of 
which is in the 50K - 2M range, with an
average size of 100K bytes.  The processing we need to do on each file is 
pretty CPU-intensive, with a lot of pattern
matching. What we need to do would fall nicely into the Map/Reduce paradigm.  
Over time, the volume of files will grow
by an order of magnitude into the range of millions of files, hence the desire 
to use a mapred distributed cluster to do
the analysis we need.

Normally, one could just concatenate the XML files into bigger input files.  
Unfortunately, one of our constrains is
that a certain percentage of these XML files will change every night, and so we 
need to be able to update the Hadoop
data store (HDFS perhaps) on a regular basis.  This would be difficult if the 
files are all concatenated.

The XML data originally comes from a number of XML databases.

Any advice/suggestions on the best way to structure our data storage of all the 
XML files so that Hadoop would run
efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still 
conveniently update the changed files on a
nightly basis?

Much appreciated!


Reply via email to