This begins to sound like you are trying to do something that is a very nice match to hadoop's map-reduce framework at all. It may be that HDFS would be very helpful to you, but map-reduce may not be so much help.
Here are a few question about your application that will help define the answer whether your application is a good map-reduce candidate: A) first and most importantly, is your program batch oriented, or is it supposed to respond quickly to random requests? If it is batch oriented, then it is likely that map-reduce will help. If it is intended to respond to random requests, then it is unlikely to be a match. B) would do you intend to have a very large number of small files (large is >1 million files, very large is greater than 10 million) or are your files very small (small is less than 10MB or so, very small is less than 1MB). If you need a very large number of files, you need to redesign your problem or look for a different file store. If you are working with very small files that nevertheless fit into memory, then you may need to concatenate files together to get larger files. C) how long a program startup time can you allow? Hadoop's map-reduce is oriented mostly around the batch processing of very large data sets which means that a fairly lengthy startup time is acceptable and even desirable if it allows faster overall throughput. If you can't stand a startup time of 10 seconds or so, then you need a non-map-reduce design. D) if you need real-time queries, can you use hbase? Based on what you have said so far, it sounds like you either have batch oriented input for relatively small batches of inputs (less than 100,000 or so) or you have a real-time query requirement. In either case, you may need to have a program that runs semi-permanently. If you have such a program, then keeping track of what data is already in memory is pretty easy and using HDFS as a file store could be really good for your application. One way to get such a long-lived program is to simply use hbase (if you can). If that doesn't work for you, you might try using the map-file structure or lucene to implement your own long-running distributed search system. If you can be more specific about what you are trying to do, you are likely to get better answers. On 2/10/08 10:05 PM, "Shimi K" <[EMAIL PROTECTED]> wrote: > I choose Hadoop more for the distributed calculation then the support for > huge files and my files do fit into memory. > I have a lot of small files and my system needs to search for something in > those files very fast. I figured I can distribute the files on a Hadoop > cluster and then uses the distributed calculation to do the search in > parallel on many files as possible. This way I would be able to return a > result faster then if I would have used one machine. > > Is there a way to tell which files are in memory? > > > On Feb 10, 2008 10:33 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> >> But if your files DO fit into memory then the datanodes that have copies >> of >> the blocks of your file will probably still have them in memory and since >> maps are typically data local, you will benefit as much as possible. >> >> >> On 2/10/08 11:17 AM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote: >> >>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it >>>> upload files >>>> from the disk each time a file is needed no matter if it was the >>>> same file >>>> that was required by the last job on the same node? >>>> >>> >>> There is no concept of caching input files across jobs. >>> >>> Hadoop is geared towards dealing with _huge_ amounts of data which >>> don't fit into memory anyway... and hence doing it across jobs is moot. >> >>
