[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171366#comment-15171366
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---------------------------------------

GitHub user thammegowda opened a pull request:

    https://github.com/apache/nutch/pull/93

    NUTCH-2144 Added an extension point and a plugin to accept external links

    This PR is a duplicate of #89 
    Recreated due to the issues caused while moving to writable git.
    
    
    @chrismattmann 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/thammegowda/nutch NUTCH-2144

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #93
    
----
commit 2015703cfd32cae98b14d2fd6af5ac4396237c48
Author: Thamme Gowda <[email protected]>
Date:   2016-02-29T03:23:26Z

    NUTCH-2144 Added an extension point and a plugin that overrides 
db.ignore.external to accept external links

commit 9a284c0d6d2aec86b00016a8abeddc07e5292ee9
Author: Thamme Gowda <[email protected]>
Date:   2016-02-29T03:29:09Z

    Add a sample config

----


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2144
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb, fetcher
>            Reporter: Thamme Gowda N
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.12
>
>         Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to