[
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013545#comment-15013545
]
Julien Nioche commented on NUTCH-2069:
--------------------------------------
> I propose to modes to be named just 'host' and 'domain'. As they are
> elsewhere.
Not really, see fetcher.queue.mode and partition.url.mode
[https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L723]
This issue is not about fixing existing discrepancies, this should be addressed
separately.
As for mixing bydomain and byDomain we do that only when comparing the strings
{code}
if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode))
{code}
changing to "byDomain" won't make any difference but feel free to change this
if you feel strongly about it
> Ignore external links based on domain
> -------------------------------------
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher, parser
> Affects Versions: 1.10
> Reporter: Julien Nioche
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of
> restricting the crawl based on the hostname. This adds a new parameter
> 'db.ignore.external.links.domain' to do the same based on the domain.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)