Re: fetch questions - freezing

Ken Krugler Fri, 28 Oct 2005 09:26:35 -0700

I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a300k url list.
Initially, its able to reach ~25 pages/s with 150 threads. Thefetcher gets progressivly slower though, dropping down to about~15 pages/s after about 2-3 hours or so and continues to slowdown. I've seen a few references on these lists to the issue, butI'm not clear on if its expected behaviour or if there's asolution to it? I've also noticed that the process takes up moreand more memory as it runs, is this expected as well?
We've run into a similar situation, though we're using Nutch 0.7.What seems to be happening is that a host is slowly trickling databack to us. This happens when we're trying to releasing theconnection, and we get stuck in the commons-httpclient code atChunkedInputStream.exhaustInputStream().
I have a theory that this happens when our http protocol max sizelimit is hit. The protocol-httpclient plugin reads up to the limit(in our case, 1MB) and then tries to release the connection, butfor some reason the host keeps sending us data, albeit at some veryslow rate. I was seeing 30Kbits/second or so.
Anyway, I've added the commons-httpclient code to my project and amplugging in some additional logging to help track down the issue.
I would appreciate any feedback. Please also note that you need toeliminate other factors, like the limit of threads per host, butmost notably the overhead of parsing - please use the -noParse flagto fetcher for all those experiments. In the past it was common forthe fetcher to be stuck in a buggy parser plugin, so you will needto eliminate this factor.

We're only using the html & text parsers, so I don't think that's theproblem. Plus we dumping the thread stack when it hangs, and it'salways in the ChunkedInputStream.exhaustInputStream() process (seetrace below).

We've left the # of threads per host set to 1, and varied the totalnumber of threads from 50 up to 400. Increasing from 50 to 200definitely improved performance, but going from 200 to 400 seemed tohave minimal impact, other than boosting the CPU usage to 80%.


More research results to come...

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: fetch questions - freezing

Reply via email to