Are you trying to do ad hoc real-time queries or batches of queries?
On 2/11/08 12:03 AM, "Shimi K" <[EMAIL PROTECTED]> wrote: > You misunderstood me. I do not want maps across different nodes to share > their cache. I am not looking for data replication across nodes. Right now > the amount of data I have on those files can fit into RAM. I want to use > Hadoop to search those files in parallel. This way I will reduce the amount > of time it will take me to search all the files on one machine. Even if the > amount of data will grow beyond the RAM of a single machine, I have no > problem adding additional machine to the cluster in order to make the search > faster. > > 90% of my jobs will do a search on all the files in the cluster! I want to > make sure that each node will not waste time uploading (from the disk) the > same file which it already uploaded for the previous job. > > Shimi > > > On Feb 11, 2008 9:10 AM, Amar Kamat <[EMAIL PROTECTED]> wrote: > >> Hi, >> I totally missed what you wanted to convey. What you want is that the >> maps(the tasks) should be able to share their caches across jobs. In >> hadoop each task is separate JVM. So sharing caches across tasks is >> sharing across JVM's and that too over time (i.e to make cache a >> separate higher level entity) which I think I would not be possible. >> What you can do is to increase the filesize i.e before uploading to the >> DFS, concatenate the files. I dont think that will affect your algorithm >> in any sense. Wherever you need to maintain the file boundary you could >> do something like >> concatenated file : >> filename1 : data1 >> filename2 : data2 >> .. >> And the map will take care of such hidden structures. >> So the only way I think is to increase the filesize through concatenation. >> Amar >> Shimi K wrote: >>> I choose Hadoop more for the distributed calculation then the support >> for >>> huge files and my files do fit into memory. >>> I have a lot of small files and my system needs to search for something >> in >>> those files very fast. I figured I can distribute the files on a Hadoop >>> cluster and then uses the distributed calculation to do the search in >>> parallel on many files as possible. This way I would be able to return a >>> result faster then if I would have used one machine. >>> >>> Is there a way to tell which files are in memory? >>> >>> >>> On Feb 10, 2008 10:33 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >>> >>> >>>> But if your files DO fit into memory then the datanodes that have >> copies >>>> of >>>> the blocks of your file will probably still have them in memory and >> since >>>> maps are typically data local, you will benefit as much as possible. >>>> >>>> >>>> On 2/10/08 11:17 AM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote: >>>> >>>> >>>>>> Is Hadoop cache frequently/LRU/MRU map input files? Or does it >>>>>> upload files >>>>>> from the disk each time a file is needed no matter if it was the >>>>>> same file >>>>>> that was required by the last job on the same node? >>>>>> >>>>>> >>>>> There is no concept of caching input files across jobs. >>>>> >>>>> Hadoop is geared towards dealing with _huge_ amounts of data which >>>>> don't fit into memory anyway... and hence doing it across jobs is >> moot. >>>>> >>>> >>> >>> >> >>
