Forcing all blocks to be present "locally"

Andrzej Bialecki Mon, 25 Sep 2006 14:06:04 -0700

Hi,

I'm investigating how to implement a map-reduce based searching inNutch. Let me describe my current plan regarding this, and why I need to"localize" the data blocks.

A mapred search job would basically run map() in a never-ending loop,serving the queries. Each node would get its group of segments, tospread the document collection more or less evenly across mapred nodes.Nutch uses segments, which are a bunch of MapFile data, pluscorresponding Lucene indexes. This data consists of relatively few, verylarge files.

Experiments show that using this data directly from DFS is way too slow,so currently it always needs to be copied from DFS to local disks. Thisis a very expensive step, which uses up valuable (and limited) localdisk space, and currently it has to be performed manually (which is evenmore expensive and error-prone).

I'm curious if there is a way to avoid this copying when porting thiscode to run as a mapred job - a way to tell DFS to locate all blocksfrom such files and if necessary over-replicate them in such a way thatfor any given node requesting this sort of access for a specific file,all blocks from this file would always be found locally, until a"de-localize" request was made, upon which DFS would go back to thenormal replication policy (and delete spurious blocks).

I was looking at the new filecache code, but it seems geared towardshandling many small files (such as config files, job jars, etc), and italso seems to simply make full local copies of "cached" files.


Any suggestions are welcome ...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Forcing all blocks to be present "locally"

Reply via email to