[
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964175#comment-14964175
]
Thamme Gowda N commented on NUTCH-2144:
---------------------------------------
Thanks for your feedback. I agree that the content-type will be missing for
newly discovered URLs. I am double sure that HTTP HEAD requests add lots of
overhead.
First I apply regex filter to limit urls. So HEAD call is made only on the URLs
matched by regex. However that was still not sufficient, so content type filter
is made optional.
For example,
{color:red}
image/jpeg,image/png,image/gif=.*\.(jpg|JPG|png$|PNG|gif|GIF)$
{color}
the above rule makes HTTP HEAD call to the urls matched by regex to determine
content type.
However,
{color:red}
=.*\.(jpg|JPG|png$|PNG|gif|GIF)$
{color}
rule doesn't make HTTP HEAD, because the content type filter is not applied.
> Plugin to override db.ignore.external to exempt interesting external domain
> URLs
> --------------------------------------------------------------------------------
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
> Issue Type: New Feature
> Components: crawldb, fetcher
> Reporter: Thamme Gowda N
> Priority: Minor
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler
> (db.ignore.external.links=true) to fetch static resources from external
> domains.
> The generalized version of this: This plugin should permit interesting URLs
> from external domains (by overriding db.ignore.external). The interesting
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
> When using Nutch to crawl images from a set of domains, the crawler needs
> to fetch all images which may be linked from CDNs and other domains. In this
> scenario, allowing all external links and then writing hundreds of regular
> expressions is not feasible for large number of domains.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)