Hi,

I'm investigating how to implement a map-reduce based searching in Nutch. Let me describe my current plan regarding this, and why I need to "localize" the data blocks.

A mapred search job would basically run map() in a never-ending loop, serving the queries. Each node would get its group of segments, to spread the document collection more or less evenly across mapred nodes. Nutch uses segments, which are a bunch of MapFile data, plus corresponding Lucene indexes. This data consists of relatively few, very large files.

Experiments show that using this data directly from DFS is way too slow, so currently it always needs to be copied from DFS to local disks. This is a very expensive step, which uses up valuable (and limited) local disk space, and currently it has to be performed manually (which is even more expensive and error-prone).

I'm curious if there is a way to avoid this copying when porting this code to run as a mapred job - a way to tell DFS to locate all blocks from such files and if necessary over-replicate them in such a way that for any given node requesting this sort of access for a specific file, all blocks from this file would always be found locally, until a "de-localize" request was made, upon which DFS would go back to the normal replication policy (and delete spurious blocks).

I was looking at the new filecache code, but it seems geared towards handling many small files (such as config files, job jars, etc), and it also seems to simply make full local copies of "cached" files.

Any suggestions are welcome ...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to