Would the "replication" parameter be sufficient for you? This will allow you
to push the system to make a copy of each block in a file on a higher set of
nodes, possibly equal to the number of nodes in your cluster. Of course,
this saves no space over local copying, but it does mean that you won't have
to do the copy manually, and local-access should be sped up.

Just use "hadoop dfs -setrep -R # /path/to/criticalfiles" where # = your
cluster size. This assumes you're running a DataNode on each node that you
want the copies made to (and, well, that the nodes doing lookups == the
nodes running datanodes, or else you'll end up with extra copies).

On 9/25/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Hi,

I'm investigating how to implement a map-reduce based searching in
Nutch. Let me describe my current plan regarding this, and why I need to
"localize" the data blocks.

A mapred search job would basically run map() in a never-ending loop,
serving the queries. Each node would get its group of segments, to
spread the document collection more or less evenly across mapred nodes.
Nutch uses segments, which are a bunch of MapFile data, plus
corresponding Lucene indexes. This data consists of relatively few, very
large files.

Experiments show that using this data directly from DFS is way too slow,
so currently it always needs to be copied from DFS to local disks. This
is a very expensive step, which uses up valuable (and limited) local
disk space, and currently it has to be performed manually (which is even
more expensive and error-prone).

I'm curious if there is a way to avoid this copying when porting this
code to run as a mapred job - a way to tell DFS to locate all blocks
from such files and if necessary over-replicate them in such a way that
for any given node requesting this sort of access for a specific file,
all blocks from this file would always be found locally, until a
"de-localize" request was made, upon which DFS would go back to the
normal replication policy (and delete spurious blocks).

I was looking at the new filecache code, but it seems geared towards
handling many small files (such as config files, job jars, etc), and it
also seems to simply make full local copies of "cached" files.

Any suggestions are welcome ...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp

Reply via email to