Re: So many Unfetched Pages using MapReduce

Mike Smith Thu, 19 Jan 2006 10:10:17 -0800

Hi Florent

I did some more testings. Here is the results:


I have 3 machines, P4 and 1G ram. All three are data node and one is
namenode. I started from 80000 seed urls and tried to see the effect of
depth 1 crawl for different configuration.

Number of unfetch pages changes with different configurations:

--Configuration 1
Number of map tasks: 3
Number of reduce tasks: 3
Number of fetch threads: 40
Number of thread per host: 2
http.timeout: 10 sec
-------------------------------
6700 pages fetched

--Configuration 2
Number of map tasks: 12
Number of reduce tasks: 6
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
-------------------------------
18000 pages fetched

--Configuration 3
Number of map tasks: 40
Number of reduce tasks: 20
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
-------------------------------
37000 pages fetched

--Configuration 4
Number of map tasks: 100
Number of reduce tasks: 20
Number of fetch threads: 100
Number of thread per host: 20
http.timeout: 10 sec
-------------------------------
34000 pages fetched


--Configuration 5
Number of map tasks: 50
Number of reduce tasks: 50
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
-------------------------------
52000 pages fetched

--Configuration 6
Number of map tasks: 50
Number of reduce tasks: 100
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
-------------------------------
57000 pages fetched

--Configuration 7
Number of map tasks: 50
Number of reduce tasks: 120
Number of fetch threads: 250
Number of thread per host: 20
http.timeout: 20 sec
-------------------------------
60000 pages fetched



Do you have any idea why pages are missing from the fetcher without the any
log or exceptions? It seems it really depends on the number of reduce
tasks!
Thanks, Mike



On 1/17/06, Mike Smith <[EMAIL PROTECTED]> wrote:
>
> I've experienced the same effect. When I decrease number of map/reduce
> tasks, I can fetch more web pages. but increasing those increases unfetched
> pages. I also get some "java.net.SocketTimeoutException : Read timed out"
> exceptions in my datanode log files. But those time out problems couldn't
> cause this much missing pages!! I agree the problem should be somewhere is
> the fetcher.
>
> Mike
>
>
> On 1/17/06, Florent Gluck <[EMAIL PROTECTED]> wrote:
> >
> > I'm having the exact same problem.
> > I noticed that changing the number of map/reduce tasks gives me
> > different DB_fetched results.
> > Looking at the logs, a lot of urls are actually missing.  I can't find
> > their trace *anywhere* in the logs (whether on the slaves or the
> > master).  I'm puzzled.  Currently I'm trying to debug the code to see
> > what's going on.
> > So far, I noticed the generator is fine, so the issue must lay further
> > in the pipeline (fetcher?).
> >
> > Let me know if you find anything regarding this issue. Thanks.
> >
> > --Flo
> >
> > Mike Smith wrote:
> >
> > >Hi,
> > >
> > >I have setup for boxes using MapReduce, everything goes smoothly, I
> > have
> > >feeded about 80000 seed nodes for begining and I have crawled by depth
> > 2.
> > >Only 1900 pages (about 300MG) data and the rest is marked and db
> > unfetched.
> > >Does any one know what could be wrong?
> > >
> > >This is the output of (bin/nutch readdb h2/crawldb -stats):
> > >
> > >060115 171625 Statistics for CrawlDb: h2/crawldb
> > >060115 171625 TOTAL urls:       99403
> > >060115 171625 avg score:        1.01
> > >060115 171625 max score:        7.382
> > >060115 171625 min score:        1.0
> > >060115 171625 retry 0:  99403
> > >060115 171625 status 1 (DB_unfetched):  97470
> > >060115 171625 status 2 (DB_fetched):    1933
> > >060115 171625 CrawlDb statistics: done
> > >
> > >Thanks,
> > >Mike
> > >
> > >
> > >
> >
> >
>

Re: So many Unfetched Pages using MapReduce

Reply via email to