Chris M. Hostetter created SOLR-15294:
-----------------------------------------
Summary: Support "post-indexing" cleanup of documents with
duplicate signatures
Key: SOLR-15294
URL: https://issues.apache.org/jira/browse/SOLR-15294
Project: Solr
Issue Type: Sub-task
Security Level: Public (Default Security Level. Issues are Public)
Reporter: Chris M. Hostetter
Since there is no way to (efficiently) have a document "overwrite" some
existing document with a different {{'id'}} but the same value in a
{{'signature'}} field, We should see if we can implement a solution to
"cleanup" these kinds of psuedo-duplicates after a "batch" of indexing.
In the trivial case of adding one document, a Delete-By-Query for
{{(signatureField:sig -id:currentDoc)}} DBQ could be run right after adding
{{currentDoc}}) ... but this doesn't scale well when adding many many docs and
broadcasting these DBQs across many shards (an operation which requires a
distributed collection wide lock to ensure atomicity)
It would be nice if Solr offered some kind of _efficient_ functionality for
accomplishing the same eventual goal, in a way that could be run after a bulk
indexing job, or periodically under continuous indexing, such that "duplicate"
documents would _eventually_ be cleaned up.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]