[ http://issues.apache.org/jira/browse/NUTCH-54?page=all ]

Andrzej Bialecki  updated NUTCH-54:
-----------------------------------

    Attachment: parsestatus.patch

* HTML "meta" tags processor has been extended, so that it collects and 
processes all meta tags. Convenience methods have been added to handle refresh 
meta tag, so that Fetcher can support multiple redirects.

* the interaction between content parsers and their users has been changed, 
from exception-driven to status-driven. This gives a much better control over 
the logic flow, and enables us to communicate more information than just a 
plaintext message. I plan to make similar changes for protocol handlers, which 
should greatly simplify the logic in Fetcher.

* preliminary changes to Fetcher to support automatic redirection loop, if 
parsers report a "refresh" meta directive.

* scaffolding to support parsing complete pages (i.e. pages fetched together 
with all their elements, such as JavaScript and CSS).

Any comments and suggestions are welcome!

> Fetcher  improvements
> ---------------------
>
>          Key: NUTCH-54
>          URL: http://issues.apache.org/jira/browse/NUTCH-54
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: parsestatus.patch
>
> Fetcher improvements.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to