Processing a large quantity of smaller XML files?

Andrzej Jan Taramina Mon, 14 Sep 2009 06:41:28 -0700

I'm new to Hadoop, so pardon the potentially dumb question....

I've gathered, from much research, that Hadoop is not always a good choice when 
you need to process a whack of smaller
files, which is what we need to do.


More specifically, we need to start by processing about 250K XML files, each of 
which is in the 50K - 2M range, with an
average size of 100K bytes.  The processing we need to do on each file is 
pretty CPU-intensive, with a lot of pattern
matching. What we need to do would fall nicely into the Map/Reduce paradigm.  
Over time, the volume of files will grow
by an order of magnitude into the range of millions of files, hence the desire 
to use a mapred distributed cluster to do
the analysis we need.

Normally, one could just concatenate the XML files into bigger input files.  
Unfortunately, one of our constrains is
that a certain percentage of these XML files will change every night, and so we 
need to be able to update the Hadoop
data store (HDFS perhaps) on a regular basis.  This would be difficult if the 
files are all concatenated.

The XML data originally comes from a number of XML databases.

Any advice/suggestions on the best way to structure our data storage of all the 
XML files so that Hadoop would run
efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still 
conveniently update the changed files on a
nightly basis?

Much appreciated!

-- 
Andrzej Taramina
Chaeron Corporation: Enterprise System Solutions
http://www.chaeron.com

Processing a large quantity of smaller XML files?

Reply via email to