There is an issue with the PDFBox library shipped with Nutch 0.7. It will
hang parsing certain PDF files. PDFBox 0.7.2 fixes this issue.  If you are
parsing PDF files, then this could also be a problem.

Thanks,

Steve Betts
[EMAIL PROTECTED]
937-477-1797

-----Original Message-----
From: Byron Miller [mailto:[EMAIL PROTECTED]
Sent: Friday, October 28, 2005 8:10 AM
To: [email protected]
Subject: Re: fetch questions - freezing

For what its worth i fetch my segments of 1 million
urls with 80 threads at a time and no slow downs.


I'll grab some of my stats and publish them, but i
haven't had problems with fetcher slowing down like
this in a long time.

(linux/Centos 4.2 platform)

-byron

--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Ken Krugler wrote:
>
> >> I'm using the mapred branch on a FreeBSD 7.0 box
> to do fetchs of a
> >> 300k url list.
> >>
> >> Initially, its able to reach ~25 pages/s with 150
> threads. The
> >> fetcher gets progressivly slower though, dropping
> down to about ~15
> >> pages/s after about 2-3 hours or so and continues
> to slow down. I've
> >> seen a few references on these lists to the
> issue, but I'm not clear
> >> on if its expected behaviour or if there's a
> solution to it? I've
> >> also noticed that the process takes up more and
> more memory as it
> >> runs, is this expected as well?
> >
> >
> > We've run into a similar situation, though we're
> using Nutch 0.7. What
> > seems to be happening is that a host is slowly
> trickling data back to
> > us. This happens when we're trying to releasing
> the connection, and we
> > get stuck in the commons-httpclient code at
> > ChunkedInputStream.exhaustInputStream().
> >
> > I have a theory that this happens when our http
> protocol max size
> > limit is hit. The protocol-httpclient plugin reads
> up to the limit (in
> > our case, 1MB) and then tries to release the
> connection, but for some
> > reason the host keeps sending us data, albeit at
> some very slow rate.
> > I was seeing 30Kbits/second or so.
> >
> > Anyway, I've added the commons-httpclient code to
> my project and am
> > plugging in some additional logging to help track
> down the issue.
>
>
> I would appreciate any feedback. Please also note
> that you need to
> eliminate other factors, like the limit of threads
> per host, but most
> notably the overhead of parsing - please use the
> -noParse flag to
> fetcher for all those experiments. In the past it
> was common for the
> fetcher to be stuck in a buggy parser plugin, so you
> will need to
> eliminate this factor.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>


Reply via email to