I'm having the exact same problem.
I noticed that changing the number of map/reduce tasks gives me
different DB_fetched results.
Looking at the logs, a lot of urls are actually missing.  I can't find
their trace *anywhere* in the logs (whether on the slaves or the
master).  I'm puzzled.  Currently I'm trying to debug the code to see
what's going on.
So far, I noticed the generator is fine, so the issue must lay further
in the pipeline (fetcher?).

Let me know if you find anything regarding this issue. Thanks.

--Flo

Mike Smith wrote:

>Hi,
>
>I have setup for boxes using MapReduce, everything goes smoothly, I have
>feeded about 80000 seed nodes for begining and I have crawled by depth 2.
>Only 1900 pages (about 300MG) data and the rest is marked and db unfetched.
>Does any one know what could be wrong?
>
>This is the output of (bin/nutch readdb h2/crawldb -stats):
>
>060115 171625 Statistics for CrawlDb: h2/crawldb
>060115 171625 TOTAL urls:       99403
>060115 171625 avg score:        1.01
>060115 171625 max score:        7.382
>060115 171625 min score:        1.0
>060115 171625 retry 0:  99403
>060115 171625 status 1 (DB_unfetched):  97470
>060115 171625 status 2 (DB_fetched):    1933
>060115 171625 CrawlDb statistics: done
>
>Thanks,
>Mike
>
>  
>

Reply via email to