Re: Using Map/Reduce without HDFS?

Ted Dunning Sun, 26 Aug 2007 10:00:06 -0700


If you follow your situation to the logical extreme, you get a situation
somewhat similar to what we have.

At the far end of this logic, you get huge numbers of files to process.  We
get more than 100,000 per hour.  It quickly becomes apparent that it is
impossible to process this many relatively small files efficiently.  At the
very least, your disk throughput drops dramatically.

The simple solution is to package files together in such a way that you can
start anywhere in the package and process a number of files.  Using this
technique, you can pretty easily have a much smaller number of very large
files.  Moreover, because of the start-anywhere design, these large files
can be processed efficiently in Hadoop because they will be near the
programs processing them and because they will be read from disk in large
sequential swathes.

If you have a very small problem that doesn't stress your machines much,
then this may not matter.  But then again, if you have a small problem like
that, then copying your files into HDFS isn't a problem either.  If you have
a large problem, then consolidating your files and storing them in HDFS will
be a win anyway.  

On 8/26/07 8:22 AM, "mfc" <[EMAIL PROTECTED]> wrote:

> 
> Hi,
> 
> Can Hadoop run Map/Reduce directly on files in a local file system and would
> this make sense?
> 
> Seems like there is a tradeoff to be made when you have to process lots and
> lots of little files. The
> tradeoff is the average size of the files. If they are small (under 10k in
> size) and there are
> thousands of them, would it make sense to process the files directly from
> the local file system via
> Map/Reduce?
> 
> Is there a mode in Hadoop to do this? Does Hadoop make sense to use in this
> case?
> 
> Thanks

Re: Using Map/Reduce without HDFS?

Reply via email to