brainstorm wrote:

This is one example crawled segment:

/user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000

As you see, just one part-NNNN file is generated... in the conf file
(nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
suggested in previous emails).

First of all - for a 7 node cluster the mapred.map.tasks should be set at least to something around 23 or 31 or even higher, and the number of reduce tasks to e.g. 11.

Secondly - you should not put this property in nutch-site.xml, instead it should be put in mapred-default.xml or hadoop-site.xml. I lost track of which version of Nutch / Hadoop you are using ... if it's Hadoop 0.12.x, then you need to be careful about where you put mapred.map.tasks, and it has to be placed in mapred-default.xml. If it's a more recent Hadoop version then you can put these values in hadoop-site.xml.

And finally - what is the distribution of urls in your seed list among unique hosts? I.e. how many urls come from a single host? Guessing from the path above - if you are trying to do a DMOZ crawl, then the distribution should be ok. I've done a DMOZ crawl a month ago, using the then current trunk/ and all was working well.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to