[ 
https://issues.apache.org/jira/browse/NUTCH-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian updated NUTCH-1830:
-------------------------
    Attachment: solr_delete_duplicates_add_exclusions_2.patch

Cleaned code a little - used commons lang string join instead of reinventing 
the wheel

> Solr Delete Duplicates: Adding option to exclude IDs matching specified 
> patterns
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-1830
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1830
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Brian
>            Priority: Minor
>              Labels: Solr, dedupe, nutch
>         Attachments: regex-exclude-urls-from-dedup.txt, 
> solr_delete_duplicates_add_exclusions.patch, 
> solr_delete_duplicates_add_exclusions_2.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The SolrDeleteDuplicates class and associated function has been helpful for 
> getting rid of duplicate pages from variations of URLs.  However, there are 
> some cases where the pages are very similar in terms of textual content but 
> still need to be kept as distinct searchable pages.
> Sometimes the textual content of two documents is very close so they would be 
> counted as duplicates by the duplicate detector, but we may want both of them 
> to be searchable.  
> For example for some products or resellers of our products, the webpage 
> template is the same and only a small amount of text may differ between 
> different products/resellers.  Therefore some are counted as duplicates, but 
> we want all to be included and searchable on our site, so people can find 
> things by name, even if it is not in a key field.
> We can manually specify which group of URLs these pages correspond to (via 
> some regexes) to prevent them from being potentially deleted as duplicates.
> As a result this provides a mechanism for manually excluding documents via ID 
> from deduplication.
>         
> This patch adds an option to the configuration of nutch-site.xml, allowing 
> users to specify a file containing a list of regular expressions with a new 
> property "solr.exclude.from.dedup.regex.file":
> {code:xml}
> <property>
>    <name>solr.exclude.from.dedup.regex.file</name>
>    <value>regex-exclude-urls-from-dedup.txt</value>
>    <description>
>       Holds the file name of the file containing any regular expressions 
> specifying URLs (ids) to be excluded from the Solr Deduplication process.
>       I.e., any URL matching one of the regular expressions will not be 
> subject to potential deduplication.
>       Each pattern string must start on its own line with a "-" character at 
> the beginning - all other lines will be ignored.
>       Also, the URLs must match the entire pattern.
>    </description>
> </property>
> {code}
>  
> The property specifies a file name containing a list of regular expressions, 
> indicated by the line starting with "-"
>    -If any ID matches one of these expressions during the deduplication 
> process, the document with that ID will be skipped
>         --I.e., it will not be subject to deduplication
> Here is an example file:
> {code}
> #Allows specifying regular expressions for which any matching URLs
> #will not be subjected to potential deduplication
> #Requires regex strings to match full URL
> #Each regex string must start with "-" all other lines are ignored.
> #Excludeall reseller pages from deduplication:
> -.*/company/reseller/.*
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to