I'm having the exact same problem.
I noticed that changing the number of map/reduce tasks gives me
different DB_fetched results.
Looking at the logs, a lot of urls are actually missing.  I can't find
their trace *anywhere* in the logs (whether on the slaves or the
master).  I'm puzzled.  Currently I'm trying to debug the code to see
what's going on.
So far, I noticed the generator is fine, so the issue must lay further
in the pipeline (fetcher?).

Let me know if you find anything regarding this issue. Thanks.

--Flo

Mike Smith wrote:

>Hi,
>
>I have setup for boxes using MapReduce, everything goes smoothly, I have
>feeded about 80000 seed nodes for begining and I have crawled by depth 2.
>Only 1900 pages (about 300MG) data and the rest is marked and db unfetched.
>Does any one know what could be wrong?
>
>This is the output of (bin/nutch readdb h2/crawldb -stats):
>
>060115 171625 Statistics for CrawlDb: h2/crawldb
>060115 171625 TOTAL urls:       99403
>060115 171625 avg score:        1.01
>060115 171625 max score:        7.382
>060115 171625 min score:        1.0
>060115 171625 retry 0:  99403
>060115 171625 status 1 (DB_unfetched):  97470
>060115 171625 status 2 (DB_fetched):    1933
>060115 171625 CrawlDb statistics: done
>
>Thanks,
>Mike
>
>  
>



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to