Final review: Fetcher improvements, ready to commit

Andrzej Bialecki Tue, 31 May 2005 14:10:07 -0700

Hi,

I believe I reached a stage where the code is ready to be committed.Please review the final version I posted onhttp://issues.apache.org/jira/browse/NUTCH-54 .


The most significant changes in this version of the patchset:

* ProtocolStatus and ParseStatus are persistent, which should provideother components a much better status reporting on fetching and parsingprocesses. ProtocolStatus is persisted as a part of FetcherOutput, andParseStatus is persisted in ParseData. All consumers of previous (int)status codes have been converted to use this new status information.NOTE: this changes the on-disk formats for segment data. The code willread previous formats, but it will produce only the new format.

* HttpClient-based plugin has been improved to handle other HTTP codesin a better way. It now handles also HTTPS, including sites withnon-valid or self-signed certificates.

* HtmlParser has been extended to provide an alternative implementationbased on TagSoup. In my experiments it works at least as well as Neko,and there were also cases when it was working much better, especiallyfor very broken html (e.g. containing multiple <html> tags). Theselection of which parser implementation is used is made from the configfile.

* Javascript parser and filter: now this plugin is able to handle bothstandalone JS files, and code snippets embedded in <script> and HTML events.

Please review this code. If there are no objections, I will startcommitting this after the next 24 hours.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Final review: Fetcher improvements, ready to commit

Reply via email to