[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

Brian (JIRA) Wed, 17 Jul 2013 11:30:45 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711365#comment-13711365
 ]


Brian edited comment on NUTCH-1614 at 7/17/13 6:29 PM:
-------------------------------------------------------

Can you please tell me how to do this?  I couldn't find anything about how to 
do this.  From what I can tell URL filters apply to crawling not just indexing 
I couldn't see how to apply it to only indexing.  I don't see how normalizing a 
URL would help in this case if it still filters the URL from the crawl and not 
just indexing.

I see an option with the solrindex command, but it appears to be only available 
in nutch 1.x.  Even if it were in 2.x it is not clear how to use the option to 
achieve the desired effect from the documentation:
http://wiki.apache.org/nutch/bin/nutch%20solrindex

                
      was (Author: brian44):
    Can you please tell me how to do this?  I couldn't find anything about how 
to do this.  From what I can tell URL filters apply to crawling not just 
indexing I couldn't see how to apply it to only indexing.  I don't see how 
normalizing a URL would help in this case if it still filters the URL from the 
crawl and not just indexing.
                  
> Plugin to exclude URLs matching regex list from indexing - to enable crawl 
> but do not index
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1614
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1614
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 2.2.1
>            Reporter: Brian
>            Priority: Minor
>              Labels: plugin
>         Attachments: NUTCH-1614.patch
>
>
> Some pages we need to crawl (such as some main pages and different views of a 
> main page) to get all the other pages, but we don't want to index those pages 
> themselves.  Therefore we cannot use the url filter approach.
> This plugin uses a file containing regex strings (see included sample file).  
> If one of the regex strings matches with an entire URL, that URL will be 
> excluded form indexing.
> The file to use is specified by the following property in nutch-site.xml:
> <property>
>         <name>indexer.url.filter.exclude.regex.file</name>
>         <value>regex-indexer-exclude-urls.txt</value>
>         <description>
>             Holds the file name containing the regex strings.  Any URL 
> matching one of these strings will be excluded from indexing. 
>             "#" indicates a comment line and will be ignored.
>         </description>
> </property>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

Reply via email to