Hi Semyon,

sorry for the late answer. Yes, you're right the naming in nutch-default.xml is 
wrong.
Please open a Jira issue to address this.

The description should also mention that the property
crawldb.url.filters is a "temporary" and set/overwritten by command-line 
options.
Cf. the overview (somewhat outdated) on
  https://wiki.apache.org/nutch/NutchPropertiesCompleteList

Best,
Sebastian

On 02/19/2018 02:24 PM, Semyon Semyonov wrote:
> Gents,
> 
> To use URL filters and Normalizers in CrawlDBUpdate the three config setting 
> may be used:
>  
> In CrawlDbFilter line 41:43
>   public static final String URL_FILTERING = "crawldb.url.filters";
>   public static final String URL_NORMALIZING = "crawldb.url.normalizers";
>   public static final String URL_NORMALIZING_SCOPE = 
> "crawldb.url.normalizers.scope";
> 
> 
> However, in nutch-default we have different names 
> <property>
>     <name>db.url.normalizers</name>
>     <value>false</value>
>     <description>Normalize urls when updating crawldb</description>
> </property>
> 
> <property>
>     <name>db.url.filters</name>
>     <value>false</value>
>     <description>Filter urls when updating crawldb</description>
> </property>
> 
> 
> Obviously, that is the reason why URLNormalizers/Filters dont work.
> 
> Should I change CrawlDbFilter code to
>  public static final String URL_FILTERING = "db.url.filters";
>   public static final String URL_NORMALIZING = "db.url.normalizers";
>   public static final String URL_NORMALIZING_SCOPE = 
> "db.url.normalizers.scope";
> 
> 
> ?
> 
> Semyon.
> 

Reply via email to