I'm having the exact same problem. I noticed that changing the number of map/reduce tasks gives me different DB_fetched results. Looking at the logs, a lot of urls are actually missing. I can't find their trace *anywhere* in the logs (whether on the slaves or the master). I'm puzzled. Currently I'm trying to debug the code to see what's going on. So far, I noticed the generator is fine, so the issue must lay further in the pipeline (fetcher?).
Let me know if you find anything regarding this issue. Thanks. --Flo Mike Smith wrote: >Hi, > >I have setup for boxes using MapReduce, everything goes smoothly, I have >feeded about 80000 seed nodes for begining and I have crawled by depth 2. >Only 1900 pages (about 300MG) data and the rest is marked and db unfetched. >Does any one know what could be wrong? > >This is the output of (bin/nutch readdb h2/crawldb -stats): > >060115 171625 Statistics for CrawlDb: h2/crawldb >060115 171625 TOTAL urls: 99403 >060115 171625 avg score: 1.01 >060115 171625 max score: 7.382 >060115 171625 min score: 1.0 >060115 171625 retry 0: 99403 >060115 171625 status 1 (DB_unfetched): 97470 >060115 171625 status 2 (DB_fetched): 1933 >060115 171625 CrawlDb statistics: done > >Thanks, >Mike > > >
