nutch-default.xml has outdated example for urlfilter.order
----------------------------------------------------------
Key: NUTCH-388
URL: http://issues.apache.org/jira/browse/NUTCH-388
Project: Nutch
Issue Type: Bug
Components: documentation, fetcher
Affects Versions: 0.8.1, 0.8
Reporter: Jared Dunne
Priority: Minor
The description for the nutch-default.xml entry for urlfilter.order is
misleading/outdated. In the example it refers to
org.apache.nutch.net.RegexURLFilter & org.apache.nutch.net.PrefixURLFilter,
when it should refer to org.apache.nutch.urlfilter.regex.RegexURLFilter &
org.apache.nutch.urlfilter.prefix.PrefixURLFilter.
<property>
<name>urlfilter.order</name>
<value></value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>
We wanted to run prefix before regex so we copied the example from the
description and reversed it. Since these package names are incorrect, it did
not work and the following warnings appeared in our logs for each of the URLs
in our fetchlist:
2006-10-17 15:55:46,533 WARN crawl.Injector - Skipping
http://bar.foo.com/:java.lang.NullPointerException
2006-10-17 15:55:46,533 WARN crawl.Injector - Skipping
http://baz.foo.com/:java.lang.NullPointerException
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira