[
https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2220:
---------------------------------
Description:
We need an option db.ignore.internal.links that operates in FetcherThread, just
like db.ignore.external.links. It already exists but it only used by the
LinkDB, and defaults to true, which is no good option for FetcherThread.
I propose to make a clear distinction between which are used for LinkDB or not.
Most options used by LinkDB already use the right prefix but db.ignore.*.links,
db.max.inlinks and db.max.anchor.length not yet.
This patch will rename those options to linkdb.* prefixes so afterwards we can
implement db.ignore.internal.links that operates in FetcherThread, just like
db.ignore.external.links.
This will introduce a change in default parameters. Please comment.
h2. How to upgrade from earlier releases
* replace your old conf/nutch-default.xml with the conf/nutch-default.xml from
Nutch 1.12 release
* if you use LinkDB (e.g. invertlinks) and modified parameters
{{db.max.inlinks}} and/or {{db.max.anchor.length}} and/or
{{db.ignore.internal.links}}, rename those parameters to {{linkdb.max.inlinks}}
and {{linkdb.max.anchor.length}} and {{linkdb.ignore.internal.links}}
* {{db.ignore.internal.links}} and {{db.ignore.external.links}} now operate on
the CrawlDB only
* {{linkdb.ignore.internal.links}} and {{linkdb.ignore.external.links}} now
operate on the LinkDB only
was:
We need an option db.ignore.internal.links that operates in FetcherThread, just
like db.ignore.external.links. It already exists but it only used by the
LinkDB, and defaults to true, which is no good option for FetcherThread.
I propose to make a clear distinction between which are used for LinkDB or not.
Most options used by LinkDB already use the right prefix but db.ignore.*.links,
db.max.inlinks and db.max.anchor.length not yet.
This patch will rename those options to linkdb.* prefixes so afterwards we can
implement db.ignore.internal.links that operates in FetcherThread, just like
db.ignore.external.links.
This will introduce a change in default parameters. Please comment.
h2. How to upgrade from earlier releases
* replace your old conf/nutch-default.xml with the conf/nutch-default.xml from
Nutch 1.12 release
* if you use LinkDB (e.g. invertlinks) and modified parameters
{{db.max.inlinks}} and/or {{db.max.anchor.length}} and/or
{{db.ignore.internal.links}}, rename those parameters to {{linkdb.max.inlinks}}
and {{linkdb.max.anchor.length}} and {{linkdb.ignore.internal.links}}
* {{db.ignore.internal.links}} and {{db.ignore.external.links}} now operate on
the CrawlDB only
* {{linkdb.ignore.internal.links}} and {{linkdb.ignore.external.links}} now
operate on the LinkDB only
*
> Rename db.* options used only by the linkdb to linkdb.*
> -------------------------------------------------------
>
> Key: NUTCH-2220
> URL: https://issues.apache.org/jira/browse/NUTCH-2220
> Project: Nutch
> Issue Type: Task
> Components: linkdb
> Affects Versions: 1.11
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2220.patch
>
>
> We need an option db.ignore.internal.links that operates in FetcherThread,
> just like db.ignore.external.links. It already exists but it only used by the
> LinkDB, and defaults to true, which is no good option for FetcherThread.
> I propose to make a clear distinction between which are used for LinkDB or
> not. Most options used by LinkDB already use the right prefix but
> db.ignore.*.links, db.max.inlinks and db.max.anchor.length not yet.
> This patch will rename those options to linkdb.* prefixes so afterwards we
> can implement db.ignore.internal.links that operates in FetcherThread, just
> like db.ignore.external.links.
> This will introduce a change in default parameters. Please comment.
> h2. How to upgrade from earlier releases
> * replace your old conf/nutch-default.xml with the conf/nutch-default.xml
> from Nutch 1.12 release
> * if you use LinkDB (e.g. invertlinks) and modified parameters
> {{db.max.inlinks}} and/or {{db.max.anchor.length}} and/or
> {{db.ignore.internal.links}}, rename those parameters to
> {{linkdb.max.inlinks}} and {{linkdb.max.anchor.length}} and
> {{linkdb.ignore.internal.links}}
> * {{db.ignore.internal.links}} and {{db.ignore.external.links}} now operate
> on the CrawlDB only
> * {{linkdb.ignore.internal.links}} and {{linkdb.ignore.external.links}} now
> operate on the LinkDB only
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)