DeleteDuplicates based on crawlDB only 
---------------------------------------

                 Key: NUTCH-656
                 URL: https://issues.apache.org/jira/browse/NUTCH-656
             Project: Nutch
          Issue Type: Wish
          Components: indexer
            Reporter: julien nioche


The existing dedup functionality relies on Lucene indices and can't be used 
when the indexing is delegated to SOLR.
I was wondering whether we could use the information from the crawlDB instead 
to detect URLs to delete then do the deletions in an indexer-neutral way. As 
far as I understand the content of the crawlDB contains all the elements we 
need for dedup, namely :
* URL 
* signature
* fetch time
* score

In map-reduce terms we would have two different jobs : 
* read crawlDB and compare on URLs : keep only most recent element - oldest are 
stored in a file and will be deleted later

* read crawlDB and have a map function generating signatures as keys and URL + 
fetch time +score as value
* reduce function would depend on which parameter is set (i.e. use signature or 
score) and would output as list of URLs to delete

This assumes that we can then use the URLs to identify documents in the indices.

Any thoughts on this? Am I missing something?

Julien




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to