On Sep 14, 2009, at 8:44 AM, Piotr Praczyk wrote:
Hi Maybe you should consider using HBase instead of pure HDFS ?HDFS tends to have a big block size which would lead to a massive storagespace loss.
Nope, this is not true at all. If you have 64MB blocks and write a 32KB file, the space physically used on disk is ... 32KB.
250K XML files is not insane, and your NN would be able to *easily* handle it (10 million files in HDFS would require 4GB of RAM in the NN). However, you've correctly noticed that because you *can* handle it, it does not mean it would be *efficient*.
I would suggest looking into at Cloudera's blog posting about the "small files problem":
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/The simplest thing you could do is to use the Hadoop ARchive format (HAR) in a pre-processing step. The best thing you could do is to have a pre-processing step based on sequence file (note: either Oozie or Cascading are great workflow systems to help you out).
When you say "update" nightly, do you mean "add new files" or "update existing files"? If you really mean changing existing files, HBase might be good for you - http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture . HBase natively includes the concept of a timestamp, so you will be able to run against the the "latest version" or be able to specify a fixed version (in case if you want to repeat a night's analysis).
If you have separate archival storage for your data, you can also consider reloading every night; this might be scalable up to a few TB.
Brian
HBase runs on top of HDFS and would store many files in the same block yet allowing to modify them selectively. regards Piotr 2009/9/14 Andrzej Jan Taramina <[email protected]>I'm new to Hadoop, so pardon the potentially dumb question....I've gathered, from much research, that Hadoop is not always a good choicewhen you need to process a whack of smaller files, which is what we need to do.More specifically, we need to start by processing about 250K XML files,each of which is in the 50K - 2M range, with anaverage size of 100K bytes. The processing we need to do on each file ispretty CPU-intensive, with a lot of pattern matching. What we need to do would fall nicely into the Map/Reduce paradigm. Over time, the volume of files will growby an order of magnitude into the range of millions of files, hence thedesire to use a mapred distributed cluster to do the analysis we need.Normally, one could just concatenate the XML files into bigger input files.Unfortunately, one of our constrains isthat a certain percentage of these XML files will change every night, andso we need to be able to update the Hadoopdata store (HDFS perhaps) on a regular basis. This would be difficult ifthe files are all concatenated. The XML data originally comes from a number of XML databases.Any advice/suggestions on the best way to structure our data storage of allthe XML files so that Hadoop would runefficiently and we could thus use Map/Reduce on a Hadoop cluster, yet stillconveniently update the changed files on a nightly basis? Much appreciated! -- Andrzej Taramina Chaeron Corporation: Enterprise System Solutions http://www.chaeron.com
smime.p7s
Description: S/MIME cryptographic signature
