Hi,
This is just a quick update on the subject. As a part of my work on Fetcher improvements I brought up-to-date a plugin (contributed by Andy Hedges and Ken Meltsner), which uses Jakarta Commons HttpClient for the HTTP handling. It handles internally all hairy issues such as redirects, timeouts, cookies, authentication, etc. I also added support for HTTPS and content-based redirection (meta http-equiv=refresh).
So far in my experience it works perfectly well - at least equally well to the existing protocol-http, plus of course all additional features...
I tested it on several sites, but I'm also looking for some brave souls ;-) to test it in their environment, preferably with heavy load, to see whether it behaves properly.
You can download the patch from here:
http://www.getopt.org/nutch/20050507.patch
This is a large patch, it changes quite a few interfaces and also the way Fetcher works (see http://issues.apache.org/jira/browse/NUTCH-54 for details). You will also notice some debugging output, this will be removed in the final version.
You will need also the new plugins (protocol-http and parse-js, the latter just a skeleton for now) from here:
http://www.getopt.org/nutch/new-plugins.zip
Unpack this into the src/plugin directory.
Looking forward to your comments.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
