[ 
https://issues.apache.org/jira/browse/SOLR-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804336#comment-15804336
 ] 

Tim Owen commented on SOLR-9918:
--------------------------------

OK I see what you mean, I can explain our use-case if that helps to understand 
why we developed this processor, and when it might prove useful.

We have a Kafka queue of messages, which are a mixture of Create, Update and 
Delete operations, and these are consumed and fed into two different storage 
systems - Solr and a RDBMS. We want the behaviour to be consistent, so that the 
two systems are in sync, and the way the Database storage app works is that 
Create operations are implemented as effectively {{INSERT IF NOT EXISTS ...}} 
and Update operations are the typical SQL {{UPDATE .. WHERE id = ..}} that 
quietly do nothing if there is no row for {{id}}. So we want the Solr storage 
to behave in the same way.

There can occasionally be duplicate messages that Create the same {{id}} due to 
the hundreds of instances of the app that adds messages to Kafka, and small 
race conditions that mean two or more of them will do some duplicate work. We 
chose to accept this situation and de-dupe downstream by having both storage 
apps behave as above.

Another scenario is that, since we have the Kafka queue as a buffer, if there's 
any problems downstream we can always stop the storage apps, restore last 
night's backup, rewind the Kafka consumer offset (slightly beyond the backup 
point) and then replay. In this situation we don't want a lot of index churn 
for the overlap Create messages.

With updates, the apps which add Update messages only have best-effort 
knowledge of which document/row {{id}}s are relevant to the field/column being 
changed by the update message. So we quite commonly have messages that are 
optimistic updates, for a document that doesn't in fact exist (now). The 
database storage handles this quietly, so we wanted the same behaviour in Solr. 
Initially what happened in Solr was we'd get newly-created documents containing 
only the fields changed in the AtomicUpdate, so we added a required field to 
avoid that happening, which works but is noisy as we get a Solr exception each 
time (and then batch updates are messy because we have to split and retry).

I looked at {{DocBasedVersionConstraintsProcessor}} but we don't have 
explicitly-managed versioning for our documents in Solr. Then I looked at 
{{SignatureUpdateProcessor}} but that does churn the index and overwrites 
documents, which we didn't want. Also considered {{TolerantUpdateProcessor}} 
but that isn't really solving the issue for inserts, it just would make some 
update batches less noisy.

I'd say this processor is useful in situations where you have documents that 
don't have any concept of multiple versions that can be assigned by the app, 
and don't have any kind of fuzzy-ness about similar documents i.e. each 
document has a strong identity, akin to what a Database unique key is.

> An UpdateRequestProcessor to skip duplicate inserts and ignore updates to 
> missing docs
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-9918
>                 URL: https://issues.apache.org/jira/browse/SOLR-9918
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update
>            Reporter: Tim Owen
>         Attachments: SOLR-9918.patch, SOLR-9918.patch
>
>
> This is an UpdateRequestProcessor and Factory that we have been using in 
> production, to handle 2 common cases that were awkward to achieve using the 
> existing update pipeline and current processor classes:
> * When inserting document(s), if some already exist then quietly skip the new 
> document inserts - do not churn the index by replacing the existing documents 
> and do not throw a noisy exception that breaks the batch of inserts. By 
> analogy with SQL, {{insert if not exists}}. In our use-case, multiple 
> application instances can (rarely) process the same input so it's easier for 
> us to de-dupe these at Solr insert time than to funnel them into a global 
> ordered queue first.
> * When applying AtomicUpdate documents, if a document being updated does not 
> exist, quietly do nothing - do not create a new partially-populated document 
> and do not throw a noisy exception about missing required fields. By analogy 
> with SQL, {{update where id = ..}}. Our use-case relies on this because we 
> apply updates optimistically and have best-effort knowledge about what 
> documents will exist, so it's easiest to skip the updates (in the same way a 
> Database would).
> I would have kept this in our own package hierarchy but it relies on some 
> package-scoped methods, and seems like it could be useful to others if they 
> choose to configure it. Some bits of the code were borrowed from 
> {{DocBasedVersionConstraintsProcessorFactory}}.
> Attached patch has unit tests to confirm the behaviour.
> This class can be used by configuring solrconfig.xml like so..
> {noformat}
>   <updateRequestProcessorChain name="skipexisting">
>     <processor class="solr.LogUpdateProcessorFactory" />
>     <processor 
> class="org.apache.solr.update.processor.SkipExistingDocumentsProcessorFactory">
>       <bool name="skipInsertIfExists">true</bool>
>       <bool name="skipUpdateIfMissing">false</bool> <!-- We will override 
> this per-request -->
>     </processor>
>     <processor class="solr.DistributedUpdateProcessorFactory" />
>     <processor class="solr.RunUpdateProcessorFactory" />
>   </updateRequestProcessorChain>
> {noformat}
> and initParams defaults of
> {noformat}
>       <str name="update.chain">skipexisting</str>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to