[ http://issues.apache.org/jira/browse/NUTCH-54?page=all ]
Andrzej Bialecki updated NUTCH-54:
-----------------------------------
Attachment: final.diff
This is the final version of the Fetcher improvements, for review. The most
significant change is that now ProtocolStatus and ParseStatus are persistent,
which should provide other components a much better status reporting on
fetching and parsing processes.
HttpClient-based plugin has been improved to handle other HTTP codes in a
better way. It now handles also HTTPS, including sites with non-valid or
self-signed certificates.
HtmlParser has been extended to provide an alternative implementation based on
TagSoup. In my experiments it works at least as well as Neko, and there were
also cases when it was working much better, especially for very broken html
(e.g. containing multiple <html> tags). The selection of which parser
implementation is used is made from the config file.
> Fetcher improvements
> ---------------------
>
> Key: NUTCH-54
> URL: http://issues.apache.org/jira/browse/NUTCH-54
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: 20050518.patch, ProtocolOutput.java, final-plugins.zip,
> final.diff, new-plugins.zip, parsestatus.patch, status.patch
>
> Fetcher improvements.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira