[
https://issues.apache.org/jira/browse/NUTCH-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brian updated NUTCH-1830:
-------------------------
Attachment: regex-exclude-urls-from-dedup.txt
> Solr Delete Duplicates: Adding option to exclude IDs matching specified
> patterns
> --------------------------------------------------------------------------------
>
> Key: NUTCH-1830
> URL: https://issues.apache.org/jira/browse/NUTCH-1830
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Brian
> Priority: Minor
> Labels: Solr, dedupe, nutch
> Attachments: regex-exclude-urls-from-dedup.txt,
> solr_delete_duplicates_add_exclusions.patch
>
> Original Estimate: 0h
> Remaining Estimate: 0h
>
> The SolrDeleteDuplicates class and associated function has been helpful for
> getting rid of duplicate pages from variations of URLs. However, there are
> some cases where the pages are very similar in terms of textual content but
> still need to be kept as distinct searchable pages.
> Sometimes the textual content of two documents is very close so they would be
> counted as duplicates by the duplicate detector, but we may want both of them
> to be searchable.
> For example for some products or resellers of our products, the webpage
> template is the same and only a small amount of text may differ between
> different products/resellers. Therefore some are counted as duplicates, but
> we want all to be included and searchable on our site, so people can find
> things by name, even if it is not in a key field.
> We can manually specify which group of URLs these pages correspond to (via
> some regexes) to prevent them from being potentially deleted as duplicates.
> As a result this provides a mechanism for manually excluding documents via ID
> from deduplication.
>
> This patch adds an option to the configuration of nutch-site.xml, allowing
> users to specify a file containing a list of regular expressions with a new
> property "solr.exclude.from.dedup.regex.file":
> {code:xml}
> <property>
> <name>solr.exclude.from.dedup.regex.file</name>
> <value>regex-exclude-urls-from-dedup.txt</value>
> <description>
> Holds the file name of the file containing any regular expressions
> specifying URLs (ids) to be excluded from the Solr Deduplication process.
> I.e., any URL matching one of the regular expressions will not be
> subject to potential deduplication.
> Each pattern string must start on its own line with a "-" character at
> the beginning - all other lines will be ignored.
> Also, the URLs must match the entire pattern.
> </description>
> </property>
> {code}
>
> The property specifies a file name containing a list of regular expressions,
> indicated by the line starting with "-"
> -If any ID matches one of these expressions during the deduplication
> process, the document with that ID will be skipped
> --I.e., it will not be subject to deduplication
> Here is an example file:
> {code}
> #Allows specifying regular expressions for which any matching URLs
> #will not be subjected to potential deduplication
> #Requires regex strings to match full URL
> #Each regex string must start with "-" all other lines are ignored.
> #Excludeall reseller pages from deduplication:
> -.*/company/reseller/.*
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)