[ 
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108633#comment-13108633
 ] 

Julien Nioche commented on NUTCH-1052:
--------------------------------------

I like the original idea and agree that having to read/write the whole crawldb 
once more would be a pain for large crawls. This is a good example of what 2.0 
could add (or could have added if you are pessimistic). 

I agree with your suggestion for an alternative to the use of null as value 
which is to encode the action (add, delete) either as a complex object in the 
key or as part of the value. The latter would make more sense as it is unlikely 
that we'd add AND delete the same document as part of the same batch. Could you 
include that in your patch?

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, 
> NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the 
> URL status for "db_gone". When run multiple times the same list of URLs will 
> be deleted from Solr. For small, stable crawl databases this is not a 
> problem. For larger crawls this could be an issue. SolrClean will become an 
> expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean 
> would then check this flag in addition to the "db_gone" status before adding 
> the URL to the delete list.
> Another solution is to add a new state to the status field 
> "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has 
> successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to