For what its worth i fetch my segments of 1 million urls with 80 threads at a time and no slow downs.
I'll grab some of my stats and publish them, but i haven't had problems with fetcher slowing down like this in a long time. (linux/Centos 4.2 platform) -byron --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Ken Krugler wrote: > > >> I'm using the mapred branch on a FreeBSD 7.0 box > to do fetchs of a > >> 300k url list. > >> > >> Initially, its able to reach ~25 pages/s with 150 > threads. The > >> fetcher gets progressivly slower though, dropping > down to about ~15 > >> pages/s after about 2-3 hours or so and continues > to slow down. I've > >> seen a few references on these lists to the > issue, but I'm not clear > >> on if its expected behaviour or if there's a > solution to it? I've > >> also noticed that the process takes up more and > more memory as it > >> runs, is this expected as well? > > > > > > We've run into a similar situation, though we're > using Nutch 0.7. What > > seems to be happening is that a host is slowly > trickling data back to > > us. This happens when we're trying to releasing > the connection, and we > > get stuck in the commons-httpclient code at > > ChunkedInputStream.exhaustInputStream(). > > > > I have a theory that this happens when our http > protocol max size > > limit is hit. The protocol-httpclient plugin reads > up to the limit (in > > our case, 1MB) and then tries to release the > connection, but for some > > reason the host keeps sending us data, albeit at > some very slow rate. > > I was seeing 30Kbits/second or so. > > > > Anyway, I've added the commons-httpclient code to > my project and am > > plugging in some additional logging to help track > down the issue. > > > I would appreciate any feedback. Please also note > that you need to > eliminate other factors, like the limit of threads > per host, but most > notably the overhead of parsing - please use the > -noParse flag to > fetcher for all those experiments. In the past it > was common for the > fetcher to be stuck in a buggy parser plugin, so you > will need to > eliminate this factor. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __________________________________ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > >
