[ 
https://issues.apache.org/jira/browse/NUTCH-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian updated NUTCH-1830:
-------------------------

    Description: 
The SolrDeleteDuplicates class and associated function has been helpful for 
getting rid of duplicate pages from variations of URLs.  However, there are 
some cases where the pages are very similar in terms of textual content but 
still need to be kept as distinct searchable pages.

Sometimes the textual content of two documents is very close so they would be 
counted as duplicates by the duplicate detector, but we may want both of them 
to be searchable.  

For example for some products or resellers of our products, the webpage 
template is the same and only a small amount of text may differ between 
different products/resellers.  Therefore some are counted as duplicates, but we 
want all to be included and searchable on our site, so people can find things 
by name, even if it is not in a key field.

We can manually specify which group of URLs these pages correspond to (via some 
regexes) to prevent them from being potentially deleted as duplicates.

As a result this provides a mechanism for manually excluding documents via ID 
from deduplication.
        
This patch adds an option to the configuration of nutch-site.xml, allowing 
users to specify a file containing a list of regular expressions with a new 
property "solr.exclude.from.dedup.regex.file":

{code:xml}
<property>
   <name>solr.exclude.from.dedup.regex.file</name>
   <value>regex-exclude-urls-from-dedup.txt</value>
   <description>
      Holds the file name of the file containing any regular expressions 
specifying URLs (ids) to be excluded from the Solr Deduplication process.
      I.e., any URL matching one of the regular expressions will not be subject 
to potential deduplication.
      Each pattern string must start on its own line with a "-" character at 
the beginning - all other lines will be ignored.
      Also, the URLs must match the entire pattern.
   </description>
</property>
{code}

 
The property specifies a file name containing a list of regular expressions, 
indicated by the line starting with "-"
   -If any ID matches one of these expressions during the deduplication 
process, the document with that ID will be skipped
        --I.e., it will not be subject to deduplication

Here is an example file:
{code}
#Allows specifying regular expressions for which any matching URLs
#will not be subjected to potential deduplication
#Requires regex strings to match full URL
#Each regex string must start with "-" all other lines are ignored.

#Excludeall reseller pages from deduplication:
-.*/company/reseller/.*
{code}

  was:
The SolrDeleteDuplicates class and associated function has been helpful for 
getting rid of duplicate pages from variations of URLs.  However, there are 
some cases where the pages are very similar in terms of textual content but 
still need to be kept as distinct searchable pages.

Sometimes the textual content of two documents is very close so they would be 
counted as duplicates by the duplicate detector, but we may want both of them 
to be searchable.  

For example for some products or resellers of our products, the webpage 
template is the same and only a small amount of text may differ between 
different products/resellers.  Therefore some are counted as duplicates, but we 
want all to be included and searchable on our site, so people can find things 
by name, even if it is not in a key field.

We can manually specify which group of URLs these pages correspond to (via some 
regexes) to prevent them from being potentially deleted as duplicates.

As a result this provides a mechanism for manually excluding documents via ID 
from deduplication.
        
This patch adds an option to the configuration of nutch-site.xml, allowing 
users to specify a file containing a list of regular expressions with a new 
property "solr.exclude.from.dedup.regex.file":

<property>
   <name>solr.exclude.from.dedup.regex.file</name>
   <value>regex-exclude-urls-from-dedup.txt</value>
   <description>
      Holds the file name of the file containing any regular expressions 
specifying URLs (ids) to be excluded from the Solr Deduplication process.
      I.e., any URL matching one of the regular expressions will not be subject 
to potential deduplication.
      Each pattern string must start on its own line with a "-" character at 
the beginning - all other lines will be ignored.
      Also, the URLs must match the entire pattern.
   </description>
</property>

 
The property specifies a file name containing a list of regular expressions, 
indicated by the line starting with "-"
   -If any ID matches one of these expressions during the deduplication 
process, the document with that ID will be skipped
        --I.e., it will not be subject to deduplication

Here is an example file:
{quote}
#Allows specifying regular expressions for which any matching URLs
#will not be subjected to potential deduplication
#Requires regex strings to match full URL
#Each regex string must start with "-" all other lines are ignored.

#Excludeall reseller pages from deduplication:
-.*/company/reseller/.*
{quote}


> Solr Delete Duplicates: Adding option to exclude IDs matching specified 
> patterns
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-1830
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1830
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Brian
>            Priority: Minor
>              Labels: Solr, dedupe, nutch
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> The SolrDeleteDuplicates class and associated function has been helpful for 
> getting rid of duplicate pages from variations of URLs.  However, there are 
> some cases where the pages are very similar in terms of textual content but 
> still need to be kept as distinct searchable pages.
> Sometimes the textual content of two documents is very close so they would be 
> counted as duplicates by the duplicate detector, but we may want both of them 
> to be searchable.  
> For example for some products or resellers of our products, the webpage 
> template is the same and only a small amount of text may differ between 
> different products/resellers.  Therefore some are counted as duplicates, but 
> we want all to be included and searchable on our site, so people can find 
> things by name, even if it is not in a key field.
> We can manually specify which group of URLs these pages correspond to (via 
> some regexes) to prevent them from being potentially deleted as duplicates.
> As a result this provides a mechanism for manually excluding documents via ID 
> from deduplication.
>         
> This patch adds an option to the configuration of nutch-site.xml, allowing 
> users to specify a file containing a list of regular expressions with a new 
> property "solr.exclude.from.dedup.regex.file":
> {code:xml}
> <property>
>    <name>solr.exclude.from.dedup.regex.file</name>
>    <value>regex-exclude-urls-from-dedup.txt</value>
>    <description>
>       Holds the file name of the file containing any regular expressions 
> specifying URLs (ids) to be excluded from the Solr Deduplication process.
>       I.e., any URL matching one of the regular expressions will not be 
> subject to potential deduplication.
>       Each pattern string must start on its own line with a "-" character at 
> the beginning - all other lines will be ignored.
>       Also, the URLs must match the entire pattern.
>    </description>
> </property>
> {code}
>  
> The property specifies a file name containing a list of regular expressions, 
> indicated by the line starting with "-"
>    -If any ID matches one of these expressions during the deduplication 
> process, the document with that ID will be skipped
>         --I.e., it will not be subject to deduplication
> Here is an example file:
> {code}
> #Allows specifying regular expressions for which any matching URLs
> #will not be subjected to potential deduplication
> #Requires regex strings to match full URL
> #Each regex string must start with "-" all other lines are ignored.
> #Excludeall reseller pages from deduplication:
> -.*/company/reseller/.*
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to