[
https://issues.apache.org/jira/browse/SOLR-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308892#comment-17308892
]
Chris M. Hostetter commented on SOLR-3473:
------------------------------------------
9 years later...
The current state of things, long after the above mentioned related jira's have
been addressed, is that SignatureUpdateProcessorFactory *CAN* be safely used in
in SolrCloud for two possible usecases:
* For de-duplication:
** the signatureField _MUST_ be the uniqueKey field *AND* the processor _MUST_
be configured to run prior to DistributedUpdateProcessor
* Solely for generating signatures, w/o de-duplication
** overwriteDupes _MUST_ be set to false ... any signatureField may be used,
and it may run at any point in the processor chain
If you attempt to use SignatureUpdateProcessorFactory for de-duplication w/ a
non-uniqueKey signature field, one of two failure situations are likely to
arise:
* in a multi-shard collection, documents with identical signatureField values
will not be removed from any shard (leader) other then the one the document is
routed to (by it's id)
* even in a single-shard collection, with multiple replicas, documents with
identical signatureField values will *only* be deleted on the 'leader' and not
on any other replicas, because the leader does not propogate the
{{AddUpdateCommand.updateTerm}} computed by the SignatureUpdateProcessorFactory
to each of it's shards
----
In general, I don't think it's a good idea to try and "fix" the way
SignatureUpdateProcessorFactory implements "De-Duplicaction" over fields that
aren't the unique key field in SolrCloud. The original implementation of
{{overwriteDupes=true}} leveraged low-level lucene functionality designed for
"replacing" documents by their uniqueKey, but swapped in the "signatureField"
instead of the "id" field and then also did a (local) Delete-By-Query looking
for any docs with the same "id" field value.
This is already less efficient (locally) then a simple "replace" of a document
based on it's uniqueKey – but if we also broadcast that DBQ to every shard in a
SolrCloud usecase, it would be prohibitively slow, particularly since DBQs in
Solr require a distrbuted "lock" preventing concurrent indexing, to insure the
delete is done atomicly and shards are kept consistent (something that I didn't
realize 9 years ago)
I've opened SOLR-15290 to track some new ideas for approaching this problem –
starting with better docs, and better warnings/errors to prevent people from
getting into problematic situations – and maybe some ideas for addressing
"cloud level deduplication" in a scalable way.
> Distributed deduplication broken
> --------------------------------
>
> Key: SOLR-3473
> URL: https://issues.apache.org/jira/browse/SOLR-3473
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud, update
> Affects Versions: 4.0-ALPHA
> Reporter: Markus Jelsma
> Priority: Major
> Fix For: 4.9, 6.0
>
> Attachments: SOLR-3473-trunk-2.patch, SOLR-3473.patch, SOLR-3473.patch
>
>
> Solr's deduplication via the SignatureUpdateProcessor is broken for
> distributed updates on SolrCloud.
> Mark Miller:
> {quote}
> Looking again at the SignatureUpdateProcessor code, I think that indeed this
> won't currently work with distrib updates. Could you file a JIRA issue for
> that? The problem is that we convert update commands into solr documents -
> and that can cause a loss of info if an update proc modifies the update
> command.
> I think the reason that you see a multiple values error when you try the
> other order is because of the lack of a document clone (the other issue I
> mentioned a few emails back). Addressing that won't solve your issue though -
> we have to come up with a way to propagate the currently lost info on the
> update command.
> {quote}
> Please see the ML thread for the full discussion:
> http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]