I've been experimenting some distributed crawls using nutch 0.8 (SVN trunk
version) recently using five machines. one master node (namenode) and 4
slaves. if I use these settings in hadoop-site.xml all the injected urls
will be fetched:

mapred.map.tasks = 17
mapred.reduce.tasks = 11 or 13 or 17


but if I decrease the number of reducer as it is suggested close to the
number of host like:

mapred.map.tasks = 17  mapred.reduce.tasks = 5 or 7

Then I will not have all the urls fetched and I will have %20 of injected
urls lost without any error log!?

Does any one know what are the optimum number of number of map and reduce
tasks and why decreasing the number of reducer which basically decreases the
number of fetcher causes loosing injected urls? Here are some more
benchmarking results:



map=17  red=11

Started at: 18:47:24 PDT 2006

Finished at: 21:45:31 PDT 2006

Threads: 100

Depth: 3

Fetched: 401913



map=17  red=7

Started at: 00:17:38 PDT 2006

Finished at: 02:58:12 PDT 2006

Threads: 100

Depth: 3

Fetched: 362628



map=17  red=5

Started at: 15:33:37 PDT 2006

Finished at: 15:39:50 PDT 2006

Threads:30

Depth: 1

Fetched: 1682



map=17  red=11

Started at: 15:46:00 PDT 2006

Finished at: 15:52:27 PDT 2006

Threads: 30

Depth: 1

Fetched: 1913



map=17  red=17

Started at: 18:12:26 PDT 2006

Finished at: 18:20:57 PDT 2006

Threads: 100

Depth: 1

Fetched: 1910
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to