We have a similar setup of many smaller files. We found that using a
combined file input (64mb of files are bundled as input to a single map
job) in conjunction with compressed files on the HDFS much reduced the
overhead of the map jobs. I haven't compared distributing ZIP/TAR files
on the HDFS versus using the combined file input ... I assume the
combined file input is efficient but have not measured versus explicitly
bundling the input files.
Andrzej Jan Taramina wrote:
I'm new to Hadoop, so pardon the potentially dumb question....
I've gathered, from much research, that Hadoop is not always a good choice when
you need to process a whack of smaller
files, which is what we need to do.
More specifically, we need to start by processing about 250K XML files, each of
which is in the 50K - 2M range, with an
average size of 100K bytes. The processing we need to do on each file is
pretty CPU-intensive, with a lot of pattern
matching. What we need to do would fall nicely into the Map/Reduce paradigm.
Over time, the volume of files will grow
by an order of magnitude into the range of millions of files, hence the desire
to use a mapred distributed cluster to do
the analysis we need.
Normally, one could just concatenate the XML files into bigger input files.
Unfortunately, one of our constrains is
that a certain percentage of these XML files will change every night, and so we
need to be able to update the Hadoop
data store (HDFS perhaps) on a regular basis. This would be difficult if the
files are all concatenated.
The XML data originally comes from a number of XML databases.
Any advice/suggestions on the best way to structure our data storage of all the
XML files so that Hadoop would run
efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still
conveniently update the changed files on a
nightly basis?
Much appreciated!