Shimi K wrote:
Is Hadoop cache frequently/LRU/MRU map input files? Or does it upload files
from the disk each time a file is needed no matter if it was the same file
that was required by the last job on the same node?
Hadoop uses data locality for scheduling the tasks. Most of the hits we get is through data locality. For the rest we simply transfer the file over the network. If the data is huge then as Arun mentioned there will be no caching. What could be done is to make the copy local and somehow inform the namenode. But I guess the complexity involved in doing this handshake and its effect on re-balancing would be more. Also this amount of replication will lead to lot of redundancy and hence wastage of space. That might be one reason this is not done. If the data size is small then it could be cached but the problem is keeping this kind of state information will have high complexity as compared to the current state where the jobs are fairly independent and hence easy to handle. Also its better to copy these files over the network (since the size is small and the chance of this happening is also low). Again the question is how much of the data would you keep in cache, what policy would you use to free the cache and how would you inform the JT about cache deletions? Its good to keep it simple.
Amar
I am currently using version 0.14.4

- Shimi


Reply via email to