I've setup the nutch-hadoop-dfs environment for a single system. This used only one machine, which is namenode as well as datanode. And I ran the "bin/nutch crawl urls -depth 2 -dir crawl_test" command, and took statistics on the crawldb folder using "bin/nutch readdb crawl_test/crawldb -stats" command, it showed,
TOTAL urls: 375 retry 0: 375 min score: 0.0 avg score: 0.0070 max score: 1.019 status 1 (db_unfetched): 334 status 2 (db_fetched): 38 status 5 (db_redir_perm): 3 CrawlDb statistics: done And then I've setup the nutch-hadoop-dfs environment with 5 systems including the namenode. ,And after the same crawl is performed , the statistics are taken and are as follows. TOTAL urls: 141 retry 0: 140 retry 1: 1 min score: 0.0 avg score: 0.015 max score: 1.003 status 1 (db_unfetched): 131 status 2 (db_fetched): 8 status 4 (db_redir_temp): 1 status 5 (db_redir_perm): 1 CrawlDb statistics: done Someone please explain why there is a difference in the number of urls fetched, when the number of datanodes are increased from 1 to 5. thanks in advance.
