Re: Caching frequently map input files

Amar Kamat Sun, 10 Feb 2008 22:39:40 -0800

Shimi K wrote:

Is Hadoop cache frequently/LRU/MRU map input files? Or does it upload files
from the disk each time a file is needed no matter if it was the same file
that was required by the last job on the same node?

Hadoop uses data locality for scheduling the tasks. Most of the hits weget is through data locality. For the rest we simply transfer the fileover the network. If the data is huge then as Arun mentioned there willbe no caching. What could be done is to make the copy local and somehowinform the namenode. But I guess the complexity involved in doing thishandshake and its effect on re-balancing would be more. Also this amountof replication will lead to lot of redundancy and hence wastage ofspace. That might be one reason this is not done. If the data size issmall then it could be cached but the problem is keeping this kind ofstate information will have high complexity as compared to the currentstate where the jobs are fairly independent and hence easy to handle.Also its better to copy these files over the network (since the size issmall and the chance of this happening is also low). Again the questionis how much of the data would you keep in cache, what policy would youuse to free the cache and how would you inform the JT about cachedeletions? Its good to keep it simple.

Amar

I am currently using version 0.14.4


- Shimi

Re: Caching frequently map input files

Reply via email to