Ben Vachon created NUTCH-2540:
---------------------------------

             Summary: Support Generic Deduplication in Nutch 2.x
                 Key: NUTCH-2540
                 URL: https://issues.apache.org/jira/browse/NUTCH-2540
             Project: Nutch
          Issue Type: New Feature
          Components: indexer
    Affects Versions: 2.3.1
            Reporter: Ben Vachon
             Fix For: 2.4


Currently, deduplication in 2.x exists only as a utility for the Solr index.

My use-case for Nutch required deduplication so I wrote custom code that checks 
for duplicates based on digest and deletes them at index time. I figured I'd 
port the change so that others could use it as well.

This is a very simple approach to Deduplication. There's plenty of room to 
improve it.

This change adds a new DataStore for Duplicate entries that are just lists of 
urls with signatures as keys.

A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map 
WebPages into the Duplicate DataStore.

Since the key of the Duplicate store is the digest field of the WebPage store 
entries, duplicate matching can be configured via extension of the Signature 
abstract class.

A new "-deduplicate" argument is added to the IndexingJob (false by default). 
If this flag is used, then the IndexingJob will check the Duplicate DataStore 
for duplicate URLs, run pluggable DuplicateFilters to determine which URL 
belongs to the original WebPage, and skip the WebPage if it is not the 
original, and delete (from the index) the other pages if the WebPage is the 
original.

I've also added a BasicDuplicateFilter plugin class that considers the URL with 
the shortest path to be the original.

Eventually, it would be best to consider things like score and fetch time when 
determining which WebPage to keep and which to remove.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to