Re: Caching frequently map input files

Amar Kamat Sun, 10 Feb 2008 23:08:55 -0800

Hi,

I totally missed what you wanted to convey. What you want is that themaps(the tasks) should be able to share their caches across jobs. Inhadoop each task is separate JVM. So sharing caches across tasks issharing across JVM's and that too over time (i.e to make cache aseparate higher level entity) which I think I would not be possible.What you can do is to increase the filesize i.e before uploading to theDFS, concatenate the files. I dont think that will affect your algorithmin any sense. Wherever you need to maintain the file boundary you coulddo something like

concatenated file :
filename1 : data1
filename2 : data2
..
And the map will take care of such hidden structures.
So the only way I think is to increase the filesize through concatenation.
Amar
Shimi K wrote:

I choose Hadoop more for the distributed calculation then the support for
huge files and my files do fit into memory.
I have a lot of small files and my system needs to search for something in
those files very fast. I figured I can distribute the files on a Hadoop
cluster and then uses the distributed calculation to do the search in
parallel on many files as possible. This way I would be able to return a
result faster then if I would have used one machine.


Is there a way to tell which files are in memory?


On Feb 10, 2008 10:33 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

But if your files DO fit into memory then the datanodes that have copies
of
the blocks of your file will probably still have them in memory and since
maps are typically data local, you will benefit as much as possible.


On 2/10/08 11:17 AM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:

Is Hadoop cache frequently/LRU/MRU map input files? Or does it
upload files
from the disk each time a file is needed no matter if it was the
same file
that was required by the last job on the same node?

There is no concept of caching input files across jobs.

Hadoop is geared towards dealing with _huge_ amounts of data which
don't fit into memory anyway... and hence doing it across jobs is moot.

Re: Caching frequently map input files

Reply via email to