[
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646793#comment-14646793
]
Sebastian Nagel commented on NUTCH-2069:
----------------------------------------
You're right, Julien. The code in FetcherThread does not follow the style. The
code formatting patch (NUTCH-865) is now 80 issues back in the history, Fetcher
has been refactored meanwhile, and not all commits are following the style.
It's often hard to resist :) ant not to correct the style, re-organize imports,
etc., so that patches are lean and easy to review. But back to the main topic:
+1, so far. One point: 'db.ignore.external.links' and the new
'db.ignore.external.links.domain' are mutually exclusive, "external" is either
defined by host or domain. This should be show up in the code
{code}
if (ignoreExternalLinks) { ... } else if (ignoreLinksOutsideDomain) { ... }
{code}
Or we could define this as two properties `db.ignore.external.links` +
`db.ignore.external.links.mode`. The latter can be "host" or "domain", similar
to other properties (partition.url.mode, generator.count.mode,
fetcher.queue.mode). That would be extensible and can make the code leaner.
Btw., good idea to add the formatter to 1.x as well, and if possible
automatically add it to the Eclipse project created by "ant eclipse".
> Ignore external links based on domain
> -------------------------------------
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher, parser
> Affects Versions: 1.10
> Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of
> restricting the crawl based on the hostname. This adds a new parameter
> 'db.ignore.external.links.domain' to do the same based on the domain.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)