Hi,

I believe I reached a stage where the code is ready to be committed. Please review the final version I posted on http://issues.apache.org/jira/browse/NUTCH-54 .

The most significant changes in this version of the patchset:

* ProtocolStatus and ParseStatus are persistent, which should provide other components a much better status reporting on fetching and parsing processes. ProtocolStatus is persisted as a part of FetcherOutput, and ParseStatus is persisted in ParseData. All consumers of previous (int) status codes have been converted to use this new status information. NOTE: this changes the on-disk formats for segment data. The code will read previous formats, but it will produce only the new format.

* HttpClient-based plugin has been improved to handle other HTTP codes in a better way. It now handles also HTTPS, including sites with non-valid or self-signed certificates.

* HtmlParser has been extended to provide an alternative implementation based on TagSoup. In my experiments it works at least as well as Neko, and there were also cases when it was working much better, especially for very broken html (e.g. containing multiple <html> tags). The selection of which parser implementation is used is made from the config file.

* Javascript parser and filter: now this plugin is able to handle both standalone JS files, and code snippets embedded in <script> and HTML events.

Please review this code. If there are no objections, I will start committing this after the next 24 hours.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to