[ 
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-684:
--------------------------------

    Attachment: solrdedup.patch

First version of a solr dedup feature. I haven't yet tested this patch much 
yet, so if you use it it may blow your computer.

I first thought about trying to make duplicate deletion a generic class with 
solr and lucene backends. However, lucene and solr are so different in this 
regard that, it was much easier to just
write a new solr dedup class.

Since urls are assumed to be unique in solr, SolrDeleteDuplicates only deletes 
urls with the same digest based on score. If two urls have the same digest and 
the same score then the one with the later timestamp stays.

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, 
> duplicate deletion feature (based on digests) is only available in lucene. It 
> should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to