Gents,
To use URL filters and Normalizers in CrawlDBUpdate the three config setting
may be used:
In CrawlDbFilter line 41:43
public static final String URL_FILTERING = "crawldb.url.filters";
public static final String URL_NORMALIZING = "crawldb.url.normalizers";
public static final String URL_NORMALIZING_SCOPE =
"crawldb.url.normalizers.scope";
However, in nutch-default we have different names
<property>
<name>db.url.normalizers</name>
<value>false</value>
<description>Normalize urls when updating crawldb</description>
</property>
<property>
<name>db.url.filters</name>
<value>false</value>
<description>Filter urls when updating crawldb</description>
</property>
Obviously, that is the reason why URLNormalizers/Filters dont work.
Should I change CrawlDbFilter code to
public static final String URL_FILTERING = "db.url.filters";
public static final String URL_NORMALIZING = "db.url.normalizers";
public static final String URL_NORMALIZING_SCOPE = "db.url.normalizers.scope";
?
Semyon.