Ken Krugler wrote:
I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a
300k url list.
Initially, its able to reach ~25 pages/s with 150 threads. The
fetcher gets progressivly slower though, dropping down to about ~15
pages/s after about 2-3 hours or so and continues to slow down. I've
seen a few references on these lists to the issue, but I'm not clear
on if its expected behaviour or if there's a solution to it? I've
also noticed that the process takes up more and more memory as it
runs, is this expected as well?
We've run into a similar situation, though we're using Nutch 0.7. What
seems to be happening is that a host is slowly trickling data back to
us. This happens when we're trying to releasing the connection, and we
get stuck in the commons-httpclient code at
ChunkedInputStream.exhaustInputStream().
I have a theory that this happens when our http protocol max size
limit is hit. The protocol-httpclient plugin reads up to the limit (in
our case, 1MB) and then tries to release the connection, but for some
reason the host keeps sending us data, albeit at some very slow rate.
I was seeing 30Kbits/second or so.
Anyway, I've added the commons-httpclient code to my project and am
plugging in some additional logging to help track down the issue.
I would appreciate any feedback. Please also note that you need to
eliminate other factors, like the limit of threads per host, but most
notably the overhead of parsing - please use the -noParse flag to
fetcher for all those experiments. In the past it was common for the
fetcher to be stuck in a buggy parser plugin, so you will need to
eliminate this factor.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com