[ http://issues.apache.org/jira/browse/NUTCH-54?page=all ]
Andrzej Bialecki updated NUTCH-54:
-----------------------------------
Attachment: parsestatus.patch
* HTML "meta" tags processor has been extended, so that it collects and
processes all meta tags. Convenience methods have been added to handle refresh
meta tag, so that Fetcher can support multiple redirects.
* the interaction between content parsers and their users has been changed,
from exception-driven to status-driven. This gives a much better control over
the logic flow, and enables us to communicate more information than just a
plaintext message. I plan to make similar changes for protocol handlers, which
should greatly simplify the logic in Fetcher.
* preliminary changes to Fetcher to support automatic redirection loop, if
parsers report a "refresh" meta directive.
* scaffolding to support parsing complete pages (i.e. pages fetched together
with all their elements, such as JavaScript and CSS).
Any comments and suggestions are welcome!
> Fetcher improvements
> ---------------------
>
> Key: NUTCH-54
> URL: http://issues.apache.org/jira/browse/NUTCH-54
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Attachments: parsestatus.patch
>
> Fetcher improvements.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira