I've experienced the same effect. When I decrease number of map/reduce
tasks, I can fetch more web pages. but increasing those increases unfetched
pages. I also get some "java.net.SocketTimeoutException: Read timed out"
exceptions in my datanode log files. But those time out problems couldn't
cause this much missing pages!! I agree the problem should be somewhere is
the fetcher.

Mike


On 1/17/06, Florent Gluck <[EMAIL PROTECTED]> wrote:
>
> I'm having the exact same problem.
> I noticed that changing the number of map/reduce tasks gives me
> different DB_fetched results.
> Looking at the logs, a lot of urls are actually missing.  I can't find
> their trace *anywhere* in the logs (whether on the slaves or the
> master).  I'm puzzled.  Currently I'm trying to debug the code to see
> what's going on.
> So far, I noticed the generator is fine, so the issue must lay further
> in the pipeline (fetcher?).
>
> Let me know if you find anything regarding this issue. Thanks.
>
> --Flo
>
> Mike Smith wrote:
>
> >Hi,
> >
> >I have setup for boxes using MapReduce, everything goes smoothly, I have
> >feeded about 80000 seed nodes for begining and I have crawled by depth 2.
> >Only 1900 pages (about 300MG) data and the rest is marked and db
> unfetched.
> >Does any one know what could be wrong?
> >
> >This is the output of (bin/nutch readdb h2/crawldb -stats):
> >
> >060115 171625 Statistics for CrawlDb: h2/crawldb
> >060115 171625 TOTAL urls:       99403
> >060115 171625 avg score:        1.01
> >060115 171625 max score:        7.382
> >060115 171625 min score:        1.0
> >060115 171625 retry 0:  99403
> >060115 171625 status 1 (DB_unfetched):  97470
> >060115 171625 status 2 (DB_fetched):    1933
> >060115 171625 CrawlDb statistics: done
> >
> >Thanks,
> >Mike
> >
> >
> >
>
>

Reply via email to