Brian created NUTCH-1614:
----------------------------

             Summary: Plugin to exclude URLs matching regex list from indexing 
- to enable crawl but do not index
                 Key: NUTCH-1614
                 URL: https://issues.apache.org/jira/browse/NUTCH-1614
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
    Affects Versions: 2.2.1
            Reporter: Brian
            Priority: Minor


Some pages we need to crawl (such as some main pages and different views of a 
main page) to get all the other pages, but we don't want to index those pages 
themselves.  Therefore we cannot use the url filter approach.

This plugin uses a file containing regex strings (see included sample file).  
If one of the regex strings matches with an entire URL, that URL will be 
excluded form indexing.

The file to use is specified by the following property in nutch-site.xml:

<property>
        <name>indexer.url.filter.exclude.regex.file</name>
        <value>regex-indexer-exclude-urls.txt</value>
        <description>
            Holds the file name containing the regex strings.  Any URL matching 
one of these strings will be excluded from indexing. 
            "#" indicates a comment line and will be ignored.
        </description>
</property>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to