charlie w wrote:
The answer is that http.content.limit is indeed broken in the
protocol-httpclient plugin, though it doesn't really look like it's
entirely Nutch's fault.
The org.apache.nutch.protocol.httpclient.HttpResponse class is doing the
right thing in trying to abort the GET at the content limit, but when it
calls close on the input stream of the request at HttpResponse.java:120,
the org.apache.commons.httpclient.AutoCloseInputStream class goes off and
tries to read the entire response anyway.
I found that if get.abort() is called when the content goes over limit, the
request is terminated and Nutch is able to do the right thing.
The default protocol-http plugin does not use the apache commons httpclient
stuff, and works correctly.
Could you please create a JIRA issue, so that your analysis and the
possible fix is recorded? Thanks!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com