Shimi K wrote:
Is Hadoop cache frequently/LRU/MRU map input files? Or does it upload files
from the disk each time a file is needed no matter if it was the same file
that was required by the last job on the same node?
Hadoop uses data locality for scheduling the tasks. Most of the hits we
get is through data locality. For the rest we simply transfer the file
over the network. If the data is huge then as Arun mentioned there will
be no caching. What could be done is to make the copy local and somehow
inform the namenode. But I guess the complexity involved in doing this
handshake and its effect on re-balancing would be more. Also this amount
of replication will lead to lot of redundancy and hence wastage of
space. That might be one reason this is not done. If the data size is
small then it could be cached but the problem is keeping this kind of
state information will have high complexity as compared to the current
state where the jobs are fairly independent and hence easy to handle.
Also its better to copy these files over the network (since the size is
small and the chance of this happening is also low). Again the question
is how much of the data would you keep in cache, what policy would you
use to free the cache and how would you inform the JT about cache
deletions? Its good to keep it simple.
Amar
I am currently using version 0.14.4
- Shimi