For what its worth i fetch my segments of 1 million
urls with 80 threads at a time and no slow downs.


I'll grab some of my stats and publish them, but i
haven't had problems with fetcher slowing down like
this in a long time.

(linux/Centos 4.2 platform)

-byron

--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Ken Krugler wrote:
> 
> >> I'm using the mapred branch on a FreeBSD 7.0 box
> to do fetchs of a 
> >> 300k url list.
> >>
> >> Initially, its able to reach ~25 pages/s with 150
> threads. The 
> >> fetcher gets progressivly slower though, dropping
> down to about ~15 
> >> pages/s after about 2-3 hours or so and continues
> to slow down. I've 
> >> seen a few references on these lists to the
> issue, but I'm not clear 
> >> on if its expected behaviour or if there's a
> solution to it? I've 
> >> also noticed that the process takes up more and
> more memory as it 
> >> runs, is this expected as well?
> >
> >
> > We've run into a similar situation, though we're
> using Nutch 0.7. What 
> > seems to be happening is that a host is slowly
> trickling data back to 
> > us. This happens when we're trying to releasing
> the connection, and we 
> > get stuck in the commons-httpclient code at 
> > ChunkedInputStream.exhaustInputStream().
> >
> > I have a theory that this happens when our http
> protocol max size 
> > limit is hit. The protocol-httpclient plugin reads
> up to the limit (in 
> > our case, 1MB) and then tries to release the
> connection, but for some 
> > reason the host keeps sending us data, albeit at
> some very slow rate. 
> > I was seeing 30Kbits/second or so.
> >
> > Anyway, I've added the commons-httpclient code to
> my project and am 
> > plugging in some additional logging to help track
> down the issue.
> 
> 
> I would appreciate any feedback. Please also note
> that you need to 
> eliminate other factors, like the limit of threads
> per host, but most 
> notably the overhead of parsing - please use the
> -noParse flag to 
> fetcher for all those experiments. In the past it
> was common for the 
> fetcher to be stuck in a buggy parser plugin, so you
> will need to 
> eliminate this factor.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 

Reply via email to