Hi Andrzej,
I will try it out definitly. I had some attempts to using HttpClient with nutch ealier and while testing it on bigger crawls I found out one problem - http-client library logs some - not fatal from nutch point of view - errors on severe level (I think it was cyclic redirect or sth similar but would have to check it to be sure). It happens very rarely but it happend during my tests. Because of the way Fetcher is implemented currently it was causing the whole fetch process to stop.
I was doing this tests some months ago so it might be changed but it may be worth investigating.
Regards
Piotr
Andrzej Bialecki wrote:
Hi,
This is just a quick update on the subject. As a part of my work on Fetcher improvements I brought up-to-date a plugin (contributed by Andy Hedges and Ken Meltsner), which uses Jakarta Commons HttpClient for the HTTP handling. It handles internally all hairy issues such as redirects, timeouts, cookies, authentication, etc. I also added support for HTTPS and content-based redirection (meta http-equiv=refresh).
So far in my experience it works perfectly well - at least equally well to the existing protocol-http, plus of course all additional features...
I tested it on several sites, but I'm also looking for some brave souls ;-) to test it in their environment, preferably with heavy load, to see whether it behaves properly.
You can download the patch from here:
http://www.getopt.org/nutch/20050507.patch
This is a large patch, it changes quite a few interfaces and also the way Fetcher works (see http://issues.apache.org/jira/browse/NUTCH-54 for details). You will also notice some debugging output, this will be removed in the final version.
You will need also the new plugins (protocol-http and parse-js, the latter just a skeleton for now) from here:
http://www.getopt.org/nutch/new-plugins.zip
Unpack this into the src/plugin directory.
Looking forward to your comments.
