[ 
https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102846#comment-13102846
 ] 

Tim Pease commented on NUTCH-1052:
----------------------------------

This patch looks like it will work. The delete method in the SolrWriter should 
either increment the commitSize counter or a new counter should be created for 
deleting URLs.

Another thought. Should something similar be done for URLs that have changed 
into redirects? For example, a webmaster might decide to change their URL 
slugs. All the old URLs now become 301 redirects to the new URL locations. It 
would be nice to be able to purge the invalid URLs from Solr.

Thanks for all the work on this issue! My hadoop skills are slowly increasing, 
and one day soon I'll be able to submit my own patches :)

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the 
> URL status for "db_gone". When run multiple times the same list of URLs will 
> be deleted from Solr. For small, stable crawl databases this is not a 
> problem. For larger crawls this could be an issue. SolrClean will become an 
> expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean 
> would then check this flag in addition to the "db_gone" status before adding 
> the URL to the delete list.
> Another solution is to add a new state to the status field 
> "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has 
> successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to