Re: Processing a large quantity of smaller XML files?

Piotr Praczyk Mon, 14 Sep 2009 06:45:25 -0700

Hi

Maybe you should consider using HBase instead of pure HDFS ?
HDFS tends to have a big block size which would lead to a massive storage
space loss. HBase runs on top of HDFS and would store many files in the same
block yet allowing to modify them selectively.


regards
Piotr

2009/9/14 Andrzej Jan Taramina <[email protected]>

> I'm new to Hadoop, so pardon the potentially dumb question....
>
> I've gathered, from much research, that Hadoop is not always a good choice
> when you need to process a whack of smaller
> files, which is what we need to do.
>
> More specifically, we need to start by processing about 250K XML files,
> each of which is in the 50K - 2M range, with an
> average size of 100K bytes.  The processing we need to do on each file is
> pretty CPU-intensive, with a lot of pattern
> matching. What we need to do would fall nicely into the Map/Reduce
> paradigm.  Over time, the volume of files will grow
> by an order of magnitude into the range of millions of files, hence the
> desire to use a mapred distributed cluster to do
> the analysis we need.
>
> Normally, one could just concatenate the XML files into bigger input files.
>  Unfortunately, one of our constrains is
> that a certain percentage of these XML files will change every night, and
> so we need to be able to update the Hadoop
> data store (HDFS perhaps) on a regular basis.  This would be difficult if
> the files are all concatenated.
>
> The XML data originally comes from a number of XML databases.
>
> Any advice/suggestions on the best way to structure our data storage of all
> the XML files so that Hadoop would run
> efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still
> conveniently update the changed files on a
> nightly basis?
>
> Much appreciated!
>
> --
> Andrzej Taramina
> Chaeron Corporation: Enterprise System Solutions
> http://www.chaeron.com
>

Re: Processing a large quantity of smaller XML files?

Reply via email to