Hi,
I believe I reached a stage where the code is ready to be committed.
Please review the final version I posted on
http://issues.apache.org/jira/browse/NUTCH-54 .
The most significant changes in this version of the patchset:
* ProtocolStatus and ParseStatus are persistent, which should provide
other components a much better status reporting on fetching and parsing
processes. ProtocolStatus is persisted as a part of FetcherOutput, and
ParseStatus is persisted in ParseData. All consumers of previous (int)
status codes have been converted to use this new status information.
NOTE: this changes the on-disk formats for segment data. The code will
read previous formats, but it will produce only the new format.
* HttpClient-based plugin has been improved to handle other HTTP codes
in a better way. It now handles also HTTPS, including sites with
non-valid or self-signed certificates.
* HtmlParser has been extended to provide an alternative implementation
based on TagSoup. In my experiments it works at least as well as Neko,
and there were also cases when it was working much better, especially
for very broken html (e.g. containing multiple <html> tags). The
selection of which parser implementation is used is made from the config
file.
* Javascript parser and filter: now this plugin is able to handle both
standalone JS files, and code snippets embedded in <script> and HTML events.
Please review this code. If there are no objections, I will start
committing this after the next 24 hours.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com