[
https://issues.apache.org/jira/browse/NUTCH-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ben Vachon updated NUTCH-2540:
------------------------------
Environment: Non-distributed, single node, standalone Nutch jobs run in a
sinlge JVM with HBase as the data store. 2.3.1
> Support Generic Deduplication in Nutch 2.x
> ------------------------------------------
>
> Key: NUTCH-2540
> URL: https://issues.apache.org/jira/browse/NUTCH-2540
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Affects Versions: 2.3.1
> Environment: Non-distributed, single node, standalone Nutch jobs run
> in a sinlge JVM with HBase as the data store. 2.3.1
> Reporter: Ben Vachon
> Priority: Major
> Labels: dedupe
> Fix For: 2.4
>
> Original Estimate: 120h
> Remaining Estimate: 120h
>
> Currently, deduplication in 2.x exists only as a utility for the Solr index.
> My use-case for Nutch required deduplication so I wrote custom code that
> checks for duplicates based on digest and deletes them at index time. I
> figured I'd port the change so that others could use it as well.
> This is a very simple approach to Deduplication. There's plenty of room to
> improve it.
> This change adds a new DataStore for Duplicate entries that are just lists of
> urls with signatures as keys.
> A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map
> WebPages into the Duplicate DataStore.
> Since the key of the Duplicate store is the digest field of the WebPage store
> entries, duplicate matching can be configured via extension of the Signature
> abstract class.
> A new "-deduplicate" argument is added to the IndexingJob (false by default).
> If this flag is used, then the IndexingJob will check the Duplicate DataStore
> for duplicate URLs, run pluggable DuplicateFilters to determine which URL
> belongs to the original WebPage, and skip the WebPage if it is not the
> original, and delete (from the index) the other pages if the WebPage is the
> original.
> I've also added a BasicDuplicateFilter plugin class that considers the URL
> with the shortest path to be the original.
> Eventually, it would be best to consider things like score and fetch time
> when determining which WebPage to keep and which to remove.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)