I'm having the exact same problem. I noticed that changing the number of map/reduce tasks gives me different DB_fetched results. Looking at the logs, a lot of urls are actually missing. I can't find their trace *anywhere* in the logs (whether on the slaves or the master). I'm puzzled. Currently I'm trying to debug the code to see what's going on. So far, I noticed the generator is fine, so the issue must lay further in the pipeline (fetcher?).
Let me know if you find anything regarding this issue. Thanks. --Flo Mike Smith wrote: >Hi, > >I have setup for boxes using MapReduce, everything goes smoothly, I have >feeded about 80000 seed nodes for begining and I have crawled by depth 2. >Only 1900 pages (about 300MG) data and the rest is marked and db unfetched. >Does any one know what could be wrong? > >This is the output of (bin/nutch readdb h2/crawldb -stats): > >060115 171625 Statistics for CrawlDb: h2/crawldb >060115 171625 TOTAL urls: 99403 >060115 171625 avg score: 1.01 >060115 171625 max score: 7.382 >060115 171625 min score: 1.0 >060115 171625 retry 0: 99403 >060115 171625 status 1 (DB_unfetched): 97470 >060115 171625 status 2 (DB_fetched): 1933 >060115 171625 CrawlDb statistics: done > >Thanks, >Mike > > > ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
