[
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171366#comment-15171366
]
ASF GitHub Bot commented on NUTCH-2144:
---------------------------------------
GitHub user thammegowda opened a pull request:
https://github.com/apache/nutch/pull/93
NUTCH-2144 Added an extension point and a plugin to accept external links
This PR is a duplicate of #89
Recreated due to the issues caused while moving to writable git.
@chrismattmann
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/thammegowda/nutch NUTCH-2144
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nutch/pull/93.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #93
----
commit 2015703cfd32cae98b14d2fd6af5ac4396237c48
Author: Thamme Gowda <[email protected]>
Date: 2016-02-29T03:23:26Z
NUTCH-2144 Added an extension point and a plugin that overrides
db.ignore.external to accept external links
commit 9a284c0d6d2aec86b00016a8abeddc07e5292ee9
Author: Thamme Gowda <[email protected]>
Date: 2016-02-29T03:29:09Z
Add a sample config
----
> Plugin to override db.ignore.external to exempt interesting external domain
> URLs
> --------------------------------------------------------------------------------
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
> Issue Type: New Feature
> Components: crawldb, fetcher
> Reporter: Thamme Gowda N
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler
> (db.ignore.external.links=true) to fetch static resources from external
> domains.
> The generalized version of this: This plugin should permit interesting URLs
> from external domains (by overriding db.ignore.external). The interesting
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
> When using Nutch to crawl images from a set of domains, the crawler needs
> to fetch all images which may be linked from CDNs and other domains. In this
> scenario, allowing all external links and then writing hundreds of regular
> expressions is not feasible for large number of domains.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)