I'm new to Hadoop, so pardon the potentially dumb question.... I've gathered, from much research, that Hadoop is not always a good choice when you need to process a whack of smaller files, which is what we need to do.
More specifically, we need to start by processing about 250K XML files, each of which is in the 50K - 2M range, with an average size of 100K bytes. The processing we need to do on each file is pretty CPU-intensive, with a lot of pattern matching. What we need to do would fall nicely into the Map/Reduce paradigm. Over time, the volume of files will grow by an order of magnitude into the range of millions of files, hence the desire to use a mapred distributed cluster to do the analysis we need. Normally, one could just concatenate the XML files into bigger input files. Unfortunately, one of our constrains is that a certain percentage of these XML files will change every night, and so we need to be able to update the Hadoop data store (HDFS perhaps) on a regular basis. This would be difficult if the files are all concatenated. The XML data originally comes from a number of XML databases. Any advice/suggestions on the best way to structure our data storage of all the XML files so that Hadoop would run efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still conveniently update the changed files on a nightly basis? Much appreciated! -- Andrzej Taramina Chaeron Corporation: Enterprise System Solutions http://www.chaeron.com
