Bryan A. P. Pendleton wrote:
Would the "replication" parameter be sufficient for you? This will
allow you
to push the system to make a copy of each block in a file on a higher
set of
nodes, possibly equal to the number of nodes in your cluster. Of course,
this saves no space over local copying, but it does mean that you
won't have
to do the copy manually, and local-access should be sped up.
Just use "hadoop dfs -setrep -R # /path/to/criticalfiles" where # = your
cluster size. This assumes you're running a DataNode on each node that
you
want the copies made to (and, well, that the nodes doing lookups == the
nodes running datanodes, or else you'll end up with extra copies).
No, I don't think this would help ... I don't want to replicate each
segment to all nodes, I can't afford it - this would quickly exhaust the
total capacity of the cluster. If I set the replication factor lower
than the size of the cluster, then again I have no guarantee that whole
files are present locally.
Let's say I have 3 segments, and I want to run 3 map tasks, each with
its own segment data. The idea is that I want to make sure that task1
executing on node1 will have all blocks from segment1 on the local disk
of node1; and the same for task2, task3 and so on.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com