[
https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-828:
--------------------------------
Fix Version/s: (was: 1.5)
(was: nutchgora)
1.6
20120304-push-1.6
> Fetch Filter
> ------------
>
> Key: NUTCH-828
> URL: https://issues.apache.org/jira/browse/NUTCH-828
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.6
>
> Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch
>
>
> Adds a Nutch extension point for a fetch filter. The fetch filter allows
> filtering content and parse data/text after it is fetched but before it is
> written to segments. The fliter can return true if content is to be written
> or false if it is not.
> Some use cases for this filter would be topical search engines that only want
> to fetch/index certain types of content, for example a news or sports only
> search engine. In these types of situations the only way to determine if
> content belongs to a particular set is to fetch the page and then analyze the
> content. If the content passes, meaning belongs to the set of say sports
> pages, then we want to include it. If it doesn't then we want to ignore it,
> never fetch that same page in the future, and ignore any urls on that page.
> If content is rejected due to a fetch filter then its status is written to
> the CrawlDb as gone and its content is ignored and not written to segments.
> This effectively stop crawling along the crawl path of that page and the urls
> from that page. An example filter, fetch-safe, is provided that allows
> fetching content that does not contain a list of bad words.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira