[
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146685#comment-15146685
]
Sebastian Nagel commented on NUTCH-2144:
----------------------------------------
Hi [~thammegowda],
thanks! Everything looks good with the changes. It's definitely a good idea to
reuse the code from urlfilter-regex, and users will appreciate if rules/regexes
work the same way. The ant build files are ok, afaics, but I'll try to test the
plugin tomorrow.
Two points, I would like to bring up for discussion now, since this plugin will
introduce a new interface, and interfaces aren't easily changed later:
# currently the filter(...) method takes fromUrl and toUrl as arguments. The
interface could be more powerful and adaptible to further use cases if we add
## the tag name where the link comes from ("a", "img", "form", etc.). Currently
the tag name is not available in ParseOutputFormat, we would have to pass it
via Outlink from the parser where tag names are already used to filter links,
cf. property "parser.html.outlinks.ignore_tags". Tag names would be the easier
way to distinguish between page resources and real outlinks.
## similar whether it's a link or a redirect: could be used to follow redirects
when a site has moved to a different host and is now redirected, while still
ignoring external outlinks
# the naming could be more explicit: "URLExemptionFilter" or
"urlfilter-ignoreexempt" do make clear that it's about an exemption from the
"db.ignore.external.links" property. Only the config file
"conf/db-ignore-external-exemptions.txt" is sufficiently precise. To avoid
overlong names (e.g., "IgnoreExternalLinksExemptionUrlFilter"), maybe resolve
the double negation to something like "AcceptExternalUrlFilter" or
"urlfilter-externallink".
As said both points are just for discussion, or for later improvements.
> Plugin to override db.ignore.external to exempt interesting external domain
> URLs
> --------------------------------------------------------------------------------
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
> Issue Type: New Feature
> Components: crawldb, fetcher
> Reporter: Thamme Gowda N
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler
> (db.ignore.external.links=true) to fetch static resources from external
> domains.
> The generalized version of this: This plugin should permit interesting URLs
> from external domains (by overriding db.ignore.external). The interesting
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
> When using Nutch to crawl images from a set of domains, the crawler needs
> to fetch all images which may be linked from CDNs and other domains. In this
> scenario, allowing all external links and then writing hundreds of regular
> expressions is not feasible for large number of domains.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)