[
https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008410#comment-13008410
]
Julien Nioche commented on NUTCH-963:
-------------------------------------
Re-dedup on SOLR side : good point, although the SOLR dedup is based on
signature only IIRC and does not take the score of a doc into account.
The dedup/404 remover would allow to do one or both of these operations so that
people can deactivate what they don't need.
We're not likely to have the new deduplication any time soon anyway so am
definitely OK for adding the 404 remover in 1.3, provided as you said that is
has been tested and reviewed
> Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404
> urls)
> ---------------------------------------------------------------------------------
>
> Key: NUTCH-963
> URL: https://issues.apache.org/jira/browse/NUTCH-963
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Affects Versions: 2.0
> Reporter: Claudio Martella
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java,
> SolrClean.java
>
>
> When issuing recrawls it can happen that certain urls have expired (i.e. URLs
> that don't exist anymore and return 404).
> This patch creates a new command in the indexer that scans the crawldb
> looking for these urls and issues delete commands to SOLR.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira