brainstorm wrote:
This is one example crawled segment:
/user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000
As you see, just one part-NNNN file is generated... in the conf file
(nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
suggested in previous emails).
First of all - for a 7 node cluster the mapred.map.tasks should be set
at least to something around 23 or 31 or even higher, and the number of
reduce tasks to e.g. 11.
Secondly - you should not put this property in nutch-site.xml, instead
it should be put in mapred-default.xml or hadoop-site.xml. I lost track
of which version of Nutch / Hadoop you are using ... if it's Hadoop
0.12.x, then you need to be careful about where you put
mapred.map.tasks, and it has to be placed in mapred-default.xml. If it's
a more recent Hadoop version then you can put these values in
hadoop-site.xml.
And finally - what is the distribution of urls in your seed list among
unique hosts? I.e. how many urls come from a single host? Guessing from
the path above - if you are trying to do a DMOZ crawl, then the
distribution should be ok. I've done a DMOZ crawl a month ago, using the
then current trunk/ and all was working well.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com