Trunk? Map reduce? Could you describe your box setup, job division, and maybe post your conf/nutch-site.xml file?
Just trying to get things going and not have much luck with the mapreduce branch. I also tried trunk, the crawl stops around 30000 pages (out of maybe a million ), and once it's done I can't get results to show up via tomcat. Thanks, Earl --- Byron Miller <[EMAIL PROTECTED]> wrote: > For what its worth i fetch my segments of 1 million > urls with 80 threads at a time and no slow downs. > > > I'll grab some of my stats and publish them, but i > haven't had problems with fetcher slowing down like > this in a long time. > > (linux/Centos 4.2 platform) > > -byron > > --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > > Ken Krugler wrote: > > > > >> I'm using the mapred branch on a FreeBSD 7.0 > box > > to do fetchs of a > > >> 300k url list. > > >> > > >> Initially, its able to reach ~25 pages/s with > 150 > > threads. The > > >> fetcher gets progressivly slower though, > dropping > > down to about ~15 > > >> pages/s after about 2-3 hours or so and > continues > > to slow down. I've > > >> seen a few references on these lists to the > > issue, but I'm not clear > > >> on if its expected behaviour or if there's a > > solution to it? I've > > >> also noticed that the process takes up more and > > more memory as it > > >> runs, is this expected as well? > > > > > > > > > We've run into a similar situation, though we're > > using Nutch 0.7. What > > > seems to be happening is that a host is slowly > > trickling data back to > > > us. This happens when we're trying to releasing > > the connection, and we > > > get stuck in the commons-httpclient code at > > > ChunkedInputStream.exhaustInputStream(). > > > > > > I have a theory that this happens when our http > > protocol max size > > > limit is hit. The protocol-httpclient plugin > reads > > up to the limit (in > > > our case, 1MB) and then tries to release the > > connection, but for some > > > reason the host keeps sending us data, albeit at > > some very slow rate. > > > I was seeing 30Kbits/second or so. > > > > > > Anyway, I've added the commons-httpclient code > to > > my project and am > > > plugging in some additional logging to help > track > > down the issue. > > > > > > I would appreciate any feedback. Please also note > > that you need to > > eliminate other factors, like the limit of threads > > per host, but most > > notably the overhead of parsing - please use the > > -noParse flag to > > fetcher for all those experiments. In the past it > > was common for the > > fetcher to be stuck in a buggy parser plugin, so > you > > will need to > > eliminate this factor. > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ > > __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, > > Semantic Web > > ___|||__|| \| || | Embedded Unix, System > > Integration > > http://www.sigram.com Contact: info at sigram dot > > com > > > > > > > > __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com