[ http://issues.apache.org/jira/browse/NUTCH-54?page=all ]
Andrzej Bialecki updated NUTCH-54: ----------------------------------- Attachment: final.diff This is the final version of the Fetcher improvements, for review. The most significant change is that now ProtocolStatus and ParseStatus are persistent, which should provide other components a much better status reporting on fetching and parsing processes. HttpClient-based plugin has been improved to handle other HTTP codes in a better way. It now handles also HTTPS, including sites with non-valid or self-signed certificates. HtmlParser has been extended to provide an alternative implementation based on TagSoup. In my experiments it works at least as well as Neko, and there were also cases when it was working much better, especially for very broken html (e.g. containing multiple <html> tags). The selection of which parser implementation is used is made from the config file. > Fetcher improvements > --------------------- > > Key: NUTCH-54 > URL: http://issues.apache.org/jira/browse/NUTCH-54 > Project: Nutch > Type: Improvement > Components: fetcher > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Attachments: 20050518.patch, ProtocolOutput.java, final-plugins.zip, > final.diff, new-plugins.zip, parsestatus.patch, status.patch > > Fetcher improvements. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by Yahoo. Introducing Yahoo! Search Developer Network - Create apps using Yahoo! Search APIs Find out how you can build Yahoo! directly into your own Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers