Re: Processing a large quantity of smaller XML files?

Brian Bockelman Mon, 14 Sep 2009 07:07:12 -0700


On Sep 14, 2009, at 8:44 AM, Piotr Praczyk wrote:

Hi

Maybe you should consider using HBase instead of pure HDFS ?
HDFS tends to have a big block size which would lead to a massive storage
space loss.

Nope, this is not true at all. If you have 64MB blocks and write a 32KB file, the space physically used on disk is ... 32KB.

250K XML files is not insane, and your NN would be able to *easily* handle it (10 million files in HDFS would require 4GB of RAM in the NN). However, you've correctly noticed that because you *can* handle it, it does not mean it would be *efficient*.

I would suggest looking into at Cloudera's blog posting about the "small files problem":


http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

The simplest thing you could do is to use the Hadoop ARchive format (HAR) in a pre-processing step. The best thing you could do is to have a pre-processing step based on sequence file (note: either Oozie or Cascading are great workflow systems to help you out).

When you say "update" nightly, do you mean "add new files" or "update existing files"? If you really mean changing existing files, HBase might be good for you - http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture . HBase natively includes the concept of a timestamp, so you will be able to run against the the "latest version" or be able to specify a fixed version (in case if you want to repeat a night's analysis).

If you have separate archival storage for your data, you can also consider reloading every night; this might be scalable up to a few TB.


Brian

HBase runs on top of HDFS and would store many files in the same
block yet allowing to modify them selectively.

regards
Piotr

2009/9/14 Andrzej Jan Taramina <[email protected]>
I'm new to Hadoop, so pardon the potentially dumb question....
I've gathered, from much research, that Hadoop is not always a good choice
when you need to process a whack of smaller
files, which is what we need to do.
More specifically, we need to start by processing about 250K XML files,
each of which is in the 50K - 2M range, with an
average size of 100K bytes. The processing we need to do on each file is
pretty CPU-intensive, with a lot of pattern
matching. What we need to do would fall nicely into the Map/Reduce
paradigm.  Over time, the volume of files will grow
by an order of magnitude into the range of millions of files, hence the
desire to use a mapred distributed cluster to do
the analysis we need.
Normally, one could just concatenate the XML files into bigger input files.
Unfortunately, one of our constrains is
that a certain percentage of these XML files will change every night, and
so we need to be able to update the Hadoop
data store (HDFS perhaps) on a regular basis. This would be difficult if
the files are all concatenated.

The XML data originally comes from a number of XML databases.
Any advice/suggestions on the best way to structure our data storage of all
the XML files so that Hadoop would run
efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still
conveniently update the changed files on a
nightly basis?

Much appreciated!

--
Andrzej Taramina
Chaeron Corporation: Enterprise System Solutions
http://www.chaeron.com

smime.p7s
Description: S/MIME cryptographic signature

Re: Processing a large quantity of smaller XML files?

Reply via email to