Brian created NUTCH-1830:
----------------------------
Summary: Solr Delete Duplicates: Adding option to exclude IDs
matching specified patterns
Key: NUTCH-1830
URL: https://issues.apache.org/jira/browse/NUTCH-1830
Project: Nutch
Issue Type: Improvement
Components: indexer
Reporter: Brian
Priority: Minor
The SolrDeleteDuplicates class and associated function has been helpful for
getting rid of duplicate pages from variations of URLs. However, there are
some cases where the pages are very similar in terms of textual content but
still need to be kept as distinct searchable pages.
Sometimes the textual content of two documents is very close so they would be
counted as duplicates by the duplicate detector, but we may want both of them
to be searchable.
For example for some products or resellers of our products, the webpage
template is the same and only a small amount of text may differ between
different products/resellers. Therefore some are counted as duplicates, but we
want all to be included and searchable on our site, so people can find things
by name, even if it is not in a key field.
We can manually specify which group of URLs these pages correspond to (via some
regexes) to prevent them from being potentially deleted as duplicates.
As a result this provides a mechanism for manually excluding documents via ID
from deduplication.
This patch adds an option to the configuration of nutch-site.xml, allowing
users to specify a file containing a list of regular expressions with a new
property "solr.exclude.from.dedup.regex.file":
<property>
<name>solr.exclude.from.dedup.regex.file</name>
<value>regex-exclude-urls-from-dedup.txt</value>
<description>
Holds the file name of the file containing any regular expressions
specifying URLs (ids) to be excluded from the Solr Deduplication process.
I.e., any URL matching one of the regular expressions will not be subject
to potential deduplication.
Each pattern string must start on its own line with a "-" character at
the beginning - all other lines will be ignored.
Also, the URLs must match the entire pattern.
</description>
</property>
The property specifies a file name containing a list of regular expressions,
indicated by the line starting with "-"
-If any ID matches one of these expressions during the deduplication
process, the document with that ID will be skipped
--I.e., it will not be subject to deduplication
Here is an example file:
#Allows specifying regular expressions for which any matching URLs
#will not be subjected to potential deduplication
#Requires regex strings to match full URL
#Each regex string must start with "-" all other lines are ignored.
#Excludeall reseller pages from deduplication:
-.*/company/reseller/.*
--
This message was sent by Atlassian JIRA
(v6.2#6252)