I've been experimenting some distributed crawls using nutch 0.8 (SVN trunk
version) recently using five machines. one master node (namenode) and 4
slaves. if I use these settings in hadoop-site.xml all the injected urls
will be fetched:
mapred.map.tasks = 17
mapred.reduce.tasks = 11 or 13 or 17
but if I decrease the number of reducer as it is suggested close to the
number of host like:
mapred.map.tasks = 17 mapred.reduce.tasks = 5 or 7
Then I will not have all the urls fetched and I will have %20 of injected
urls lost without any error log!?
Does any one know what are the optimum number of number of map and reduce
tasks and why decreasing the number of reducer which basically decreases the
number of fetcher causes loosing injected urls? Here are some more
benchmarking results:
map=17 red=11
Started at: 18:47:24 PDT 2006
Finished at: 21:45:31 PDT 2006
Threads: 100
Depth: 3
Fetched: 401913
map=17 red=7
Started at: 00:17:38 PDT 2006
Finished at: 02:58:12 PDT 2006
Threads: 100
Depth: 3
Fetched: 362628
map=17 red=5
Started at: 15:33:37 PDT 2006
Finished at: 15:39:50 PDT 2006
Threads:30
Depth: 1
Fetched: 1682
map=17 red=11
Started at: 15:46:00 PDT 2006
Finished at: 15:52:27 PDT 2006
Threads: 30
Depth: 1
Fetched: 1913
map=17 red=17
Started at: 18:12:26 PDT 2006
Finished at: 18:20:57 PDT 2006
Threads: 100
Depth: 1
Fetched: 1910
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general