Hi Maybe you should consider using HBase instead of pure HDFS ? HDFS tends to have a big block size which would lead to a massive storage space loss. HBase runs on top of HDFS and would store many files in the same block yet allowing to modify them selectively.
regards Piotr 2009/9/14 Andrzej Jan Taramina <[email protected]> > I'm new to Hadoop, so pardon the potentially dumb question.... > > I've gathered, from much research, that Hadoop is not always a good choice > when you need to process a whack of smaller > files, which is what we need to do. > > More specifically, we need to start by processing about 250K XML files, > each of which is in the 50K - 2M range, with an > average size of 100K bytes. The processing we need to do on each file is > pretty CPU-intensive, with a lot of pattern > matching. What we need to do would fall nicely into the Map/Reduce > paradigm. Over time, the volume of files will grow > by an order of magnitude into the range of millions of files, hence the > desire to use a mapred distributed cluster to do > the analysis we need. > > Normally, one could just concatenate the XML files into bigger input files. > Unfortunately, one of our constrains is > that a certain percentage of these XML files will change every night, and > so we need to be able to update the Hadoop > data store (HDFS perhaps) on a regular basis. This would be difficult if > the files are all concatenated. > > The XML data originally comes from a number of XML databases. > > Any advice/suggestions on the best way to structure our data storage of all > the XML files so that Hadoop would run > efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still > conveniently update the changed files on a > nightly basis? > > Much appreciated! > > -- > Andrzej Taramina > Chaeron Corporation: Enterprise System Solutions > http://www.chaeron.com >
