Hi Sebastian,

No problems.

Here it is,
https://issues.apache.org/jira/browse/NUTCH-2539

Semyon.
Sent: Monday, March 19, 2018 at 2:02 PM
From: "Sebastian Nagel" <wastl.na...@googlemail.com>
To: dev@nutch.apache.org
Subject: Re: Config issues with URL filters and normalizers in UpdateCrawlDb
Hi Semyon,

sorry for the late answer. Yes, you're right the naming in nutch-default.xml is wrong.
Please open a Jira issue to address this.

The description should also mention that the property
crawldb.url.filters is a "temporary" and set/overwritten by command-line options.
Cf. the overview (somewhat outdated) on
https://wiki.apache.org/nutch/NutchPropertiesCompleteList

Best,
Sebastian

On 02/19/2018 02:24 PM, Semyon Semyonov wrote:
> Gents,
>
> To use URL filters and Normalizers in CrawlDBUpdate the three config setting may be used:
>
> In CrawlDbFilter line 41:43
> public static final String URL_FILTERING = "crawldb.url.filters";
> public static final String URL_NORMALIZING = "crawldb.url.normalizers";
> public static final String URL_NORMALIZING_SCOPE = "crawldb.url.normalizers.scope";
>
>
> However, in nutch-default we have different names
> <property>
> <name>db.url.normalizers</name>
> <value>false</value>
> <description>Normalize urls when updating crawldb</description>
> </property>
>
> <property>
> <name>db.url.filters</name>
> <value>false</value>
> <description>Filter urls when updating crawldb</description>
> </property>
>
>
> Obviously, that is the reason why URLNormalizers/Filters dont work.
>
> Should I change CrawlDbFilter code to
> public static final String URL_FILTERING = "db.url.filters";
> public static final String URL_NORMALIZING = "db.url.normalizers";
> public static final String URL_NORMALIZING_SCOPE = "db.url.normalizers.scope";
>
>
> ?
>
> Semyon.
>
 

Reply via email to