Thamme Gowda N created NUTCH-2144:
-------------------------------------
Summary: Plugin to override db.ignore.external to exempt
interesting external domain URLs
Key: NUTCH-2144
URL: https://issues.apache.org/jira/browse/NUTCH-2144
Project: Nutch
Issue Type: New Feature
Components: crawldb, fetcher
Reporter: Thamme Gowda N
Priority: Minor
Create a rule based urlfilter plugin that allows focused crawler
(db.ignore.external.links=true) to fetch static resources from external domains.
The generalized version of this: This plugin should permit interesting URLs
from external domains (by overriding db.ignore.external). The interesting urls
are decided from a combination of regex and mime-type rules.
Concrete use case:
When using Nutch to crawl images from a set of domains, the crawler needs to
fetch all images which may be linked from CDNs and other domains. In this
scenario, allowing all external links and then writing hundreds of regular
expressions is not feasible for large number of domains.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)