[jira] Reopened: (NUTCH-656) DeleteDuplicates based on crawlDB only

julien nioche (JIRA) Thu, 09 Oct 2008 03:39:39 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


julien nioche reopened NUTCH-656:
---------------------------------


I suppose that the SOLR dedup mechanism is valid on a single instance. If the 
documents are distributed across a number of SOLR shards (by modifying 
NUTCH-442) there will be no way of detecting that two documents have the same 
signature if they are sent to different shards. Assuming that the documents are 
distributed across SOLR shards based on their unique ID (i.e. their URL) the 
deduplication of documents based on URLs is already done. What the SOLR-dedup 
could do would be to use the crawlDB as described earlier to find duplicates 
based on the signature and send deletion orders to the SOLR shards.

not an urging issue for the moment as NUTCH-442 supports only one SOLR backend 
though 

> DeleteDuplicates based on crawlDB only 
> ---------------------------------------
>
>                 Key: NUTCH-656
>                 URL: https://issues.apache.org/jira/browse/NUTCH-656
>             Project: Nutch
>          Issue Type: Wish
>          Components: indexer
>            Reporter: julien nioche
>
> The existing dedup functionality relies on Lucene indices and can't be used 
> when the indexing is delegated to SOLR.
> I was wondering whether we could use the information from the crawlDB instead 
> to detect URLs to delete then do the deletions in an indexer-neutral way. As 
> far as I understand the content of the crawlDB contains all the elements we 
> need for dedup, namely :
> * URL 
> * signature
> * fetch time
> * score
> In map-reduce terms we would have two different jobs : 
> * read crawlDB and compare on URLs : keep only most recent element - oldest 
> are stored in a file and will be deleted later
> * read crawlDB and have a map function generating signatures as keys and URL 
> + fetch time +score as value
> * reduce function would depend on which parameter is set (i.e. use signature 
> or score) and would output as list of URLs to delete
> This assumes that we can then use the URLs to identify documents in the 
> indices.
> Any thoughts on this? Am I missing something?
> Julien

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (NUTCH-656) DeleteDuplicates based on crawlDB only

Reply via email to