[
https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781389#comment-13781389
]
Sebastian Nagel edited comment on NUTCH-656 at 10/1/13 7:27 AM:
----------------------------------------------------------------
Hi Julien, hi Markus,
regarding robustness: what happens in a continuous crawl if two duplicate
documents switch their order regarding score? Previously, A had a higher score
than B, consequently B has been removed from index. Now B gets the higher
score, and DeduplicationJob will remove A from index. The current solr-dedup is
immune because in the second call only A is retrieved from Solr and there is no
need for deduplication.
For crawlDb-based deduplication deduplicated docs/urls must be flagged in
CrawlDb so that the index status is reflected in CrawlDb. Deduplication jobs
then can draw decisions dependent on previous deduplications/deletions. Also
status changes from "duplicate" to "not modified" could treated in a save way
by forcing a re-index (and re-fetch if required).
After duplicates are flagged in CrawlDb, deletion of duplicates than could be
done by indexing jobs. Also indexing backends (eg. CSV) which cannot really
delete documents (deletion means not to index) will profit. In addition,
re-fetch scheduling of duplicate docs could be altered to lower priority in a
scoring filter.
Flagging is possible in CrawlDatums meta data. But a new db status DUPLICATE,
although a significant change, may be more explicit and efficient. And it would
be simpler to combine various dedup jobs, eg, first by canonical links
(NUTCH-710), second by signature. It's clear that docs of status DUPLICATE need
no second (possibly contradicting) deduplication.
was (Author: wastl-nagel):
Hi Julien, hi Markus,
regarding robustness: what happens in a continuous crawl two duplicate
documents happen their order regarding score? Previously, A had a higher score
than B, consequently B has been removed from index. Now B gets the higher
score, and DeduplicationJob will remove A from index. The current solr-dedup is
immune because in the second call only A is retrieved from Solr and there is no
need for deduplication.
For crawlDb-based deduplication deduplicated docs/urls must be flagged in
CrawlDb so that the index status is reflected in CrawlDb. Deduplication jobs
then can draw decisions dependent on previous deduplications/deletions. Also
status changes from "duplicate" to "not modified" could treated in a save way
by forcing a re-index (and re-fetch if required).
After duplicates are flagged in CrawlDb, deletion of duplicates than could be
done by indexing jobs. Also indexing backends (eg. CSV) which cannot really
delete documents (deletion means not to index) will profit. In addition,
re-fetch scheduling of duplicate docs could be altered to lower priority in a
scoring filter.
Flagging is possible in CrawlDatums meta data. But a new db status DUPLICATE,
although a significant change, may be more explicit and efficient. And it would
be simpler to combine various dedup jobs, eg, first by canonical links
(NUTCH-710), second by signature. It's clear that docs of status DUPLICATE need
no second (possibly contradicting) deduplication.
> DeleteDuplicates based on crawlDB only
> ---------------------------------------
>
> Key: NUTCH-656
> URL: https://issues.apache.org/jira/browse/NUTCH-656
> Project: Nutch
> Issue Type: Wish
> Components: indexer
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Attachments: NUTCH-656.patch
>
>
> The existing dedup functionality relies on Lucene indices and can't be used
> when the indexing is delegated to SOLR.
> I was wondering whether we could use the information from the crawlDB instead
> to detect URLs to delete then do the deletions in an indexer-neutral way. As
> far as I understand the content of the crawlDB contains all the elements we
> need for dedup, namely :
> * URL
> * signature
> * fetch time
> * score
> In map-reduce terms we would have two different jobs :
> * read crawlDB and compare on URLs : keep only most recent element - oldest
> are stored in a file and will be deleted later
> * read crawlDB and have a map function generating signatures as keys and URL
> + fetch time +score as value
> * reduce function would depend on which parameter is set (i.e. use signature
> or score) and would output as list of URLs to delete
> This assumes that we can then use the URLs to identify documents in the
> indices.
> Any thoughts on this? Am I missing something?
> Julien
--
This message was sent by Atlassian JIRA
(v6.1#6144)