[jira] [Comment Edited] (NUTCH-656) DeleteDuplicates based on crawlDB only

Sebastian Nagel (JIRA) Tue, 01 Oct 2013 00:28:58 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781389#comment-13781389
 ]


Sebastian Nagel edited comment on NUTCH-656 at 10/1/13 7:27 AM:
----------------------------------------------------------------

Hi Julien, hi Markus,

regarding robustness: what happens in a continuous crawl if two duplicate 
documents switch their order regarding score? Previously, A had a higher score 
than B, consequently B has been removed from index. Now B gets the higher 
score, and DeduplicationJob will remove A from index. The current solr-dedup is 
immune because in the second call only A is retrieved from Solr and there is no 
need for deduplication. 

For crawlDb-based deduplication deduplicated docs/urls must be flagged in 
CrawlDb so that the index status is reflected in CrawlDb. Deduplication jobs 
then can draw decisions dependent on previous deduplications/deletions. Also 
status changes from "duplicate" to "not modified" could treated in a save way 
by forcing a re-index (and re-fetch if required).

After duplicates are flagged in CrawlDb, deletion of duplicates than could be 
done by indexing jobs. Also indexing backends (eg. CSV) which cannot really 
delete documents (deletion means not to index) will profit. In addition, 
re-fetch scheduling of duplicate docs could be altered to lower priority in a 
scoring filter. 

Flagging is possible in CrawlDatums meta data. But a new db status DUPLICATE, 
although a significant change, may be more explicit and efficient. And it would 
be simpler to combine various dedup jobs, eg, first by canonical links 
(NUTCH-710), second by signature. It's clear that docs of status DUPLICATE need 
no second (possibly contradicting) deduplication.



was (Author: wastl-nagel):
Hi Julien, hi Markus,

regarding robustness: what happens in a continuous crawl two duplicate 
documents happen their order regarding score? Previously, A had a higher score 
than B, consequently B has been removed from index. Now B gets the higher 
score, and DeduplicationJob will remove A from index. The current solr-dedup is 
immune because in the second call only A is retrieved from Solr and there is no 
need for deduplication. 

For crawlDb-based deduplication deduplicated docs/urls must be flagged in 
CrawlDb so that the index status is reflected in CrawlDb. Deduplication jobs 
then can draw decisions dependent on previous deduplications/deletions. Also 
status changes from "duplicate" to "not modified" could treated in a save way 
by forcing a re-index (and re-fetch if required).

After duplicates are flagged in CrawlDb, deletion of duplicates than could be 
done by indexing jobs. Also indexing backends (eg. CSV) which cannot really 
delete documents (deletion means not to index) will profit. In addition, 
re-fetch scheduling of duplicate docs could be altered to lower priority in a 
scoring filter. 

Flagging is possible in CrawlDatums meta data. But a new db status DUPLICATE, 
although a significant change, may be more explicit and efficient. And it would 
be simpler to combine various dedup jobs, eg, first by canonical links 
(NUTCH-710), second by signature. It's clear that docs of status DUPLICATE need 
no second (possibly contradicting) deduplication.


> DeleteDuplicates based on crawlDB only 
> ---------------------------------------
>
>                 Key: NUTCH-656
>                 URL: https://issues.apache.org/jira/browse/NUTCH-656
>             Project: Nutch
>          Issue Type: Wish
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-656.patch
>
>
> The existing dedup functionality relies on Lucene indices and can't be used 
> when the indexing is delegated to SOLR.
> I was wondering whether we could use the information from the crawlDB instead 
> to detect URLs to delete then do the deletions in an indexer-neutral way. As 
> far as I understand the content of the crawlDB contains all the elements we 
> need for dedup, namely :
> * URL 
> * signature
> * fetch time
> * score
> In map-reduce terms we would have two different jobs : 
> * read crawlDB and compare on URLs : keep only most recent element - oldest 
> are stored in a file and will be deleted later
> * read crawlDB and have a map function generating signatures as keys and URL 
> + fetch time +score as value
> * reduce function would depend on which parameter is set (i.e. use signature 
> or score) and would output as list of URLs to delete
> This assumes that we can then use the URLs to identify documents in the 
> indices.
> Any thoughts on this? Am I missing something?
> Julien



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Comment Edited] (NUTCH-656) DeleteDuplicates based on crawlDB only

Reply via email to