The answer is that http.content.limit is indeed broken in the protocol-httpclient plugin, though it doesn't really look like it's entirely Nutch's fault.
The org.apache.nutch.protocol.httpclient.HttpResponse class is doing the right thing in trying to abort the GET at the content limit, but when it calls close on the input stream of the request at HttpResponse.java:120, the org.apache.commons.httpclient.AutoCloseInputStream class goes off and tries to read the entire response anyway. I found that if get.abort() is called when the content goes over limit, the request is terminated and Nutch is able to do the right thing. The default protocol-http plugin does not use the apache commons httpclient stuff, and works correctly. On 5/10/07, charlie w <[EMAIL PROTECTED]> wrote:
I'm using Nutch 0.9. I appears that Nutch is ignoring the http.content.limit number in the config file. I have left this setting at the default (64K), and the httpclient plugin logs that value (...httpclient.Http - http.content.limit= 65536), yet Nutch is attempting to fetch a 115MB file. I have verified that the server is indeed sending the content-length header. As you might imagine, this takes a very long time, and the Fetcher gives up on that fetcher thread: 2007-05-10 08:29:31,963 WARN fetcher.Fetcher - Aborting with 1 hung threads. Then, the next time around the generate/fetch/update loop, Nutch again tries to fetch that same document. I'm doing a very deep crawl, and this combination of behaviors wound up giving me 10 threads all hung fetching the same file. Bandwidth issues followed... I've groveled the code, and while I see HttpBase reading the http.content.limit value from the config, I can't find anywhere that the value is actually used. Now this is a flv file, and sure, I could easily filter the URL out based in its extension, but I'd like to find a more generic solution. So is it true that http.content.limit is not implemented yet or is there something I don't understand about the way it is meant to work? Regards Charlie
