nutch-default.xml has outdated example for urlfilter.order
----------------------------------------------------------

                 Key: NUTCH-388
                 URL: http://issues.apache.org/jira/browse/NUTCH-388
             Project: Nutch
          Issue Type: Bug
          Components: documentation, fetcher
    Affects Versions: 0.8.1, 0.8
            Reporter: Jared Dunne
            Priority: Minor


The description for the nutch-default.xml entry for urlfilter.order is 
misleading/outdated.  In the example it refers to 
org.apache.nutch.net.RegexURLFilter & org.apache.nutch.net.PrefixURLFilter, 
when it should refer to org.apache.nutch.urlfilter.regex.RegexURLFilter & 
org.apache.nutch.urlfilter.prefix.PrefixURLFilter.

<property>
  <name>urlfilter.order</name>
  <value></value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>

We wanted to run prefix before regex so we copied the example from the 
description and reversed it.  Since these package names are incorrect, it did 
not work and the following warnings appeared in our logs for each of the URLs 
in our fetchlist:
2006-10-17 15:55:46,533 WARN  crawl.Injector - Skipping 
http://bar.foo.com/:java.lang.NullPointerException                           
2006-10-17 15:55:46,533 WARN  crawl.Injector - Skipping 
http://baz.foo.com/:java.lang.NullPointerException

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to