[ 
https://issues.apache.org/jira/browse/SOLR-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308889#comment-17308889
 ] 

Chris M. Hostetter commented on SOLR-15294:
-------------------------------------------

One idea for an implemetation of this would be a new "Stream Decorator" that 
could consume a (sorted) stream of all documents in the collection, and would 
emit only the documents that had the same value in a configured (signature) 
field as the document that preceeded them – essentially the inverse of how the 
{{unique()}} stream decorator works – so that it could the resulting stream 
could be fed into the existing {{delete()}} decorator.

So given a collection of documents that might look like...
{noformat}
id,signature,importance
1, X,        100
2, Y,        5
3, Y,        50
4, X,        13
5, Z,        4
6, X,        50
{noformat}
You could use something like...
{code:java}
 delete(collection1
        batchSize=500,
        not_unique(
          over="signature",
          search(collection1,
                 q="*:*"
                 qt="/export",
                 fl="id,signature,importance",
                 sort="signature asc, importance desc, id asc")))
{code}
...to delete documents 6,4,2, because those are the documents that would be 
emitted by the hypothetical {{not_unique}} decorator based on the (sorted) 
output of the search...
{noformat}
1, X,        100
6, X,        50
4, X,        13
3, Y,        50
2, Y,        5
5, Z,        4
{noformat}

> Support "post-indexing" cleanup of documents with duplicate signatures
> ----------------------------------------------------------------------
>
>                 Key: SOLR-15294
>                 URL: https://issues.apache.org/jira/browse/SOLR-15294
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> Since there is no way to (efficiently) have a document "overwrite" some 
> existing document with a different {{'id'}} but the same value in a 
> {{'signature'}} field, We should see if we can implement a solution to 
> "cleanup" these kinds of psuedo-duplicates after a "batch" of indexing.
> In the trivial case of adding one document, a Delete-By-Query for 
> {{(signatureField:sig -id:currentDoc)}} DBQ could be run right after adding 
> {{currentDoc}}) ... but this doesn't scale well when adding many many docs 
> and broadcasting these DBQs across many shards (an operation which requires a 
> distributed collection wide lock to ensure atomicity)
> It would be nice if Solr offered some kind of _efficient_ functionality for 
> accomplishing the same eventual goal, in a way that could be run after a bulk 
> indexing job, or periodically under continuous indexing, such that 
> "duplicate" documents would _eventually_ be cleaned up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to