You might try setting the block size for these files to be "very large". This should guaranty that the entire file ends up on one node.

If an index is composed of many files, you could "tar" them together so each index is exactly one file.

Might work... Of course as indexes get really large, this approach might have side effects.


On Sep 25, 2006, at 2:32 PM, Andrzej Bialecki wrote:

Bryan A. P. Pendleton wrote:
Would the "replication" parameter be sufficient for you? This will allow you to push the system to make a copy of each block in a file on a higher set of nodes, possibly equal to the number of nodes in your cluster. Of course, this saves no space over local copying, but it does mean that you won't have
to do the copy manually, and local-access should be sped up.

Just use "hadoop dfs -setrep -R # /path/to/criticalfiles" where # = your cluster size. This assumes you're running a DataNode on each node that you want the copies made to (and, well, that the nodes doing lookups == the
nodes running datanodes, or else you'll end up with extra copies).

No, I don't think this would help ... I don't want to replicate each segment to all nodes, I can't afford it - this would quickly exhaust the total capacity of the cluster. If I set the replication factor lower than the size of the cluster, then again I have no guarantee that whole files are present locally.

Let's say I have 3 segments, and I want to run 3 map tasks, each with its own segment data. The idea is that I want to make sure that task1 executing on node1 will have all blocks from segment1 on the local disk of node1; and the same for task2, task3 and so on.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to