I've setup the nutch-hadoop-dfs environment for a single system. This
used only one machine, which is namenode as well as datanode. And I ran the
"bin/nutch crawl urls -depth 2 -dir crawl_test" command, and took statistics
on the crawldb folder using "bin/nutch readdb crawl_test/crawldb -stats"
command, it showed,

TOTAL urls:     375
retry 0:        375
min score:      0.0
avg score:      0.0070
max score:      1.019
status 1 (db_unfetched):        334
status 2 (db_fetched):  38
status 5 (db_redir_perm):       3
CrawlDb statistics: done


And then I've setup the nutch-hadoop-dfs environment with 5 systems
including the namenode. ,And after the same crawl is performed , the
statistics are taken and are as follows.

TOTAL urls:     141
retry 0:        140
retry 1:        1
min score:      0.0
avg score:      0.015
max score:      1.003
status 1 (db_unfetched):        131
status 2 (db_fetched):  8
status 4 (db_redir_temp):       1
status 5 (db_redir_perm):       1
CrawlDb statistics: done

Someone please explain why there is a difference in the number of urls
fetched, when the number of datanodes are increased from 1 to 5.
thanks in advance.

Reply via email to