[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

Lewis John McGibbney (JIRA) Wed, 10 Feb 2016 09:20:09 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141213#comment-15141213
 ]


Lewis John McGibbney commented on NUTCH-2144:
---------------------------------------------

Hi [~thammegowda], limitations I see are as follows
 * as mentioned, the HEAD is going to slow stuff down. I see you're FIXME. I 
have a suggestion for the time being. Lets think about initially addressing the 
case where we don't bother with HEAD, we just reply upon mimeType detection 
through evaluation of URL suffix. What do you think about this?
 * I feel that the invocation of this entire plugin could be extended to also 
deal with db.ignore.internal. The exact same may apply for the use case when we 
wish to crawl images from a set of domains, the crawler needs to fetch all 
images which may be linked internally but I have a list of say 5000 of these 
domains. In this scenario, allowing all internal links and then writing 
hundreds of regular expressions is not feasible for large number of domains.

This is a nice patch and a lot of work. I like the extension point.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2144
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb, fetcher
>            Reporter: Thamme Gowda N
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.12
>
>         Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

Reply via email to