Chris M. Hostetter created SOLR-15294:
-----------------------------------------

             Summary: Support "post-indexing" cleanup of documents with 
duplicate signatures
                 Key: SOLR-15294
                 URL: https://issues.apache.org/jira/browse/SOLR-15294
             Project: Solr
          Issue Type: Sub-task
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Chris M. Hostetter



Since there is no way to (efficiently) have a document "overwrite" some 
existing document with a different {{'id'}} but the same value in a 
{{'signature'}} field, We should see if we can implement a solution to 
"cleanup" these kinds of psuedo-duplicates after a "batch" of indexing.

In the trivial case of adding one document, a Delete-By-Query for 
{{(signatureField:sig -id:currentDoc)}} DBQ could be run right after adding 
{{currentDoc}}) ... but this doesn't scale well when adding many many docs and 
broadcasting these DBQs across many shards (an operation which requires a 
distributed collection wide lock to ensure atomicity)

It would be nice if Solr offered some kind of _efficient_ functionality for 
accomplishing the same eventual goal, in a way that could be run after a bulk 
indexing job, or periodically under continuous indexing, such that "duplicate" 
documents would _eventually_ be cleaned up.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to