Re: Distributed fetching only happening in one node ?

Andrzej Bialecki Mon, 11 Aug 2008 03:06:32 -0700

brainstorm wrote:

This is one example crawled segment:


/user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000

As you see, just one part-NNNN file is generated... in the conf file
(nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
suggested in previous emails).

First of all - for a 7 node cluster the mapred.map.tasks should be setat least to something around 23 or 31 or even higher, and the number ofreduce tasks to e.g. 11.

Secondly - you should not put this property in nutch-site.xml, insteadit should be put in mapred-default.xml or hadoop-site.xml. I lost trackof which version of Nutch / Hadoop you are using ... if it's Hadoop0.12.x, then you need to be careful about where you putmapred.map.tasks, and it has to be placed in mapred-default.xml. If it'sa more recent Hadoop version then you can put these values inhadoop-site.xml.

And finally - what is the distribution of urls in your seed list amongunique hosts? I.e. how many urls come from a single host? Guessing fromthe path above - if you are trying to do a DMOZ crawl, then thedistribution should be ok. I've done a DMOZ crawl a month ago, using thethen current trunk/ and all was working well.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Distributed fetching only happening in one node ?

Reply via email to