[ http://issues.apache.org/jira/browse/NUTCH-54?page=all ]

Andrzej Bialecki  updated NUTCH-54:
-----------------------------------

    Attachment: final.diff

This is the final version of the Fetcher improvements, for review. The most 
significant change is that now ProtocolStatus and ParseStatus are persistent, 
which should provide other components a much better status reporting on 
fetching and parsing processes.

HttpClient-based plugin has been improved to handle other HTTP codes in a 
better way. It now handles also HTTPS, including sites with non-valid or 
self-signed certificates.

HtmlParser has been extended to provide an alternative implementation based on 
TagSoup. In my experiments it works at least as well as Neko, and there were 
also cases when it was working much better, especially for very broken html 
(e.g. containing multiple <html> tags). The selection of which parser 
implementation is used is made from the config file.

> Fetcher  improvements
> ---------------------
>
>          Key: NUTCH-54
>          URL: http://issues.apache.org/jira/browse/NUTCH-54
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: 20050518.patch, ProtocolOutput.java, final-plugins.zip, 
> final.diff, new-plugins.zip, parsestatus.patch, status.patch
>
> Fetcher improvements.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to