[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

Sebastian Nagel (JIRA) Sun, 14 Feb 2016 10:43:50 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146685#comment-15146685
 ]


Sebastian Nagel commented on NUTCH-2144:
----------------------------------------

Hi [~thammegowda],
thanks! Everything looks good with the changes. It's definitely a good idea to 
reuse the code from urlfilter-regex, and users will appreciate if rules/regexes 
work the same way. The ant build files are ok, afaics, but I'll try to test the 
plugin tomorrow.

Two points, I would like to bring up for discussion now, since this plugin will 
introduce a new interface, and interfaces aren't easily changed later:
# currently the filter(...) method takes fromUrl and toUrl as arguments. The 
interface could be more powerful and adaptible to further use cases if we add
## the tag name where the link comes from ("a", "img", "form", etc.). Currently 
the tag name is not available in ParseOutputFormat, we would have to pass it 
via Outlink from the parser where tag names are already used to filter links, 
cf. property "parser.html.outlinks.ignore_tags". Tag names would be the easier 
way to distinguish between page resources and real outlinks.
## similar whether it's a link or a redirect: could be used to follow redirects 
when a site has moved to a different host and is now redirected, while still 
ignoring external outlinks
# the naming could be more explicit: "URLExemptionFilter" or 
"urlfilter-ignoreexempt" do make clear that it's about an exemption from the 
"db.ignore.external.links" property. Only the config file 
"conf/db-ignore-external-exemptions.txt" is sufficiently precise. To avoid 
overlong names (e.g., "IgnoreExternalLinksExemptionUrlFilter"), maybe resolve 
the double negation to something like "AcceptExternalUrlFilter" or 
"urlfilter-externallink".

As said both points are just for discussion, or for later improvements.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2144
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2144
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb, fetcher
>            Reporter: Thamme Gowda N
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.12
>
>         Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

Reply via email to