[ 
https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2220:
---------------------------------
    Description: 
We need an option db.ignore.internal.links that operates in FetcherThread, just 
like db.ignore.external.links. It already exists but it only used by the 
LinkDB, and defaults to true, which is no good option for FetcherThread.

I propose to make a clear distinction between which are used for LinkDB or not. 
Most options used by LinkDB already use the right prefix but db.ignore.*.links, 
db.max.inlinks and db.max.anchor.length not yet.

This patch will rename those options to linkdb.* prefixes so afterwards we can 
implement db.ignore.internal.links that operates in FetcherThread, just like 
db.ignore.external.links.

This will introduce a change in default parameters. Please comment.

h2. How to upgrade from earlier releases
* replace your old conf/nutch-default.xml with the conf/nutch-default.xml from 
Nutch 1.12 release
* if you use LinkDB (e.g. invertlinks) and modified parameters 
{{db.max.inlinks}} and/or {{db.max.anchor.length}} and/or 
{{db.ignore.internal.links}}, rename those parameters to {{linkdb.max.inlinks}} 
and {{linkdb.max.anchor.length}} and {{linkdb.ignore.internal.links}}
* {{db.ignore.internal.links}} and {{db.ignore.external.links}} now operate on 
the CrawlDB only
* {{linkdb.ignore.internal.links}} and {{linkdb.ignore.external.links}} now 
operate on the LinkDB only

  was:
We need an option db.ignore.internal.links that operates in FetcherThread, just 
like db.ignore.external.links. It already exists but it only used by the 
LinkDB, and defaults to true, which is no good option for FetcherThread.

I propose to make a clear distinction between which are used for LinkDB or not. 
Most options used by LinkDB already use the right prefix but db.ignore.*.links, 
db.max.inlinks and db.max.anchor.length not yet.

This patch will rename those options to linkdb.* prefixes so afterwards we can 
implement db.ignore.internal.links that operates in FetcherThread, just like 
db.ignore.external.links.

This will introduce a change in default parameters. Please comment.

h2. How to upgrade from earlier releases
* replace your old conf/nutch-default.xml with the conf/nutch-default.xml from 
Nutch 1.12 release
* if you use LinkDB (e.g. invertlinks) and modified parameters 
{{db.max.inlinks}} and/or {{db.max.anchor.length}} and/or 
{{db.ignore.internal.links}}, rename those parameters to {{linkdb.max.inlinks}} 
and {{linkdb.max.anchor.length}} and {{linkdb.ignore.internal.links}}
* {{db.ignore.internal.links}} and {{db.ignore.external.links}} now operate on 
the CrawlDB only
* {{linkdb.ignore.internal.links}} and {{linkdb.ignore.external.links}} now 
operate on the LinkDB only
* 



> Rename db.* options used only by the linkdb to linkdb.*
> -------------------------------------------------------
>
>                 Key: NUTCH-2220
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2220
>             Project: Nutch
>          Issue Type: Task
>          Components: linkdb
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: NUTCH-2220.patch
>
>
> We need an option db.ignore.internal.links that operates in FetcherThread, 
> just like db.ignore.external.links. It already exists but it only used by the 
> LinkDB, and defaults to true, which is no good option for FetcherThread.
> I propose to make a clear distinction between which are used for LinkDB or 
> not. Most options used by LinkDB already use the right prefix but 
> db.ignore.*.links, db.max.inlinks and db.max.anchor.length not yet.
> This patch will rename those options to linkdb.* prefixes so afterwards we 
> can implement db.ignore.internal.links that operates in FetcherThread, just 
> like db.ignore.external.links.
> This will introduce a change in default parameters. Please comment.
> h2. How to upgrade from earlier releases
> * replace your old conf/nutch-default.xml with the conf/nutch-default.xml 
> from Nutch 1.12 release
> * if you use LinkDB (e.g. invertlinks) and modified parameters 
> {{db.max.inlinks}} and/or {{db.max.anchor.length}} and/or 
> {{db.ignore.internal.links}}, rename those parameters to 
> {{linkdb.max.inlinks}} and {{linkdb.max.anchor.length}} and 
> {{linkdb.ignore.internal.links}}
> * {{db.ignore.internal.links}} and {{db.ignore.external.links}} now operate 
> on the CrawlDB only
> * {{linkdb.ignore.internal.links}} and {{linkdb.ignore.external.links}} now 
> operate on the LinkDB only



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to