I've experienced the same effect. When I decrease number of map/reduce tasks, I can fetch more web pages. but increasing those increases unfetched pages. I also get some "java.net.SocketTimeoutException: Read timed out" exceptions in my datanode log files. But those time out problems couldn't cause this much missing pages!! I agree the problem should be somewhere is the fetcher.
Mike On 1/17/06, Florent Gluck <[EMAIL PROTECTED]> wrote: > > I'm having the exact same problem. > I noticed that changing the number of map/reduce tasks gives me > different DB_fetched results. > Looking at the logs, a lot of urls are actually missing. I can't find > their trace *anywhere* in the logs (whether on the slaves or the > master). I'm puzzled. Currently I'm trying to debug the code to see > what's going on. > So far, I noticed the generator is fine, so the issue must lay further > in the pipeline (fetcher?). > > Let me know if you find anything regarding this issue. Thanks. > > --Flo > > Mike Smith wrote: > > >Hi, > > > >I have setup for boxes using MapReduce, everything goes smoothly, I have > >feeded about 80000 seed nodes for begining and I have crawled by depth 2. > >Only 1900 pages (about 300MG) data and the rest is marked and db > unfetched. > >Does any one know what could be wrong? > > > >This is the output of (bin/nutch readdb h2/crawldb -stats): > > > >060115 171625 Statistics for CrawlDb: h2/crawldb > >060115 171625 TOTAL urls: 99403 > >060115 171625 avg score: 1.01 > >060115 171625 max score: 7.382 > >060115 171625 min score: 1.0 > >060115 171625 retry 0: 99403 > >060115 171625 status 1 (DB_unfetched): 97470 > >060115 171625 status 2 (DB_fetched): 1933 > >060115 171625 CrawlDb statistics: done > > > >Thanks, > >Mike > > > > > > > >
