If you follow your situation to the logical extreme, you get a situation somewhat similar to what we have.
At the far end of this logic, you get huge numbers of files to process. We get more than 100,000 per hour. It quickly becomes apparent that it is impossible to process this many relatively small files efficiently. At the very least, your disk throughput drops dramatically. The simple solution is to package files together in such a way that you can start anywhere in the package and process a number of files. Using this technique, you can pretty easily have a much smaller number of very large files. Moreover, because of the start-anywhere design, these large files can be processed efficiently in Hadoop because they will be near the programs processing them and because they will be read from disk in large sequential swathes. If you have a very small problem that doesn't stress your machines much, then this may not matter. But then again, if you have a small problem like that, then copying your files into HDFS isn't a problem either. If you have a large problem, then consolidating your files and storing them in HDFS will be a win anyway. On 8/26/07 8:22 AM, "mfc" <[EMAIL PROTECTED]> wrote: > > Hi, > > Can Hadoop run Map/Reduce directly on files in a local file system and would > this make sense? > > Seems like there is a tradeoff to be made when you have to process lots and > lots of little files. The > tradeoff is the average size of the files. If they are small (under 10k in > size) and there are > thousands of them, would it make sense to process the files directly from > the local file system via > Map/Reduce? > > Is there a mode in Hadoop to do this? Does Hadoop make sense to use in this > case? > > Thanks
