[ 
https://issues.apache.org/jira/browse/SOLR-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-3473:
-------------------------------------
    Description: 
The current state of things (as of 8.8) is that SignatureUpdateProcessorFactory 
*CAN* be safely used in in SolrCloud for two possible usecases:
 * For de-duplication:
 ** the signatureField _MUST_ be the uniqueKey field *AND* the processor _MUST_ 
be configured to run prior to DistributedUpdateProcessor
 * Solely for generating signatures, w/o de-duplication
 ** overwriteDupes _MUST_ be set to false ... any signatureField may be used, 
and it may run at any point in the processor chain

If you attempt to use SignatureUpdateProcessorFactory for de-duplication w/ a 
non-uniqueKey signature field, one of two failure situations are likely to 
arise:
 * in a multi-shard collection, documents with identical signatureField values 
will not be removed from any shard (leader) other then the one the document is 
routed to (by it's id)
 * even in a single-shard collection, with multiple replicas, documents with 
identical signatureField values will *only* be deleted on the 'leader' and not 
on any other replicas, because the leader does not propogate the 
{{AddUpdateCommand.updateTerm}} computed by the SignatureUpdateProcessorFactory 
to each of it's shards

{panel:title=original bug report}

Solr's deduplication via the SignatureUpdateProcessor is broken for distributed 
updates on SolrCloud.

Mark Miller:
{quote}
Looking again at the SignatureUpdateProcessor code, I think that indeed this 
won't currently work with distrib updates. Could you file a JIRA issue for 
that? The problem is that we convert update commands into solr documents - and 
that can cause a loss of info if an update proc modifies the update command.

I think the reason that you see a multiple values error when you try the other 
order is because of the lack of a document clone (the other issue I mentioned a 
few emails back). Addressing that won't solve your issue though - we have to 
come up with a way to propagate the currently lost info on the update command.
{quote}

Please see the ML thread for the full discussion: 
http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
{panel}

  was:
Solr's deduplication via the SignatureUpdateProcessor is broken for distributed 
updates on SolrCloud.

Mark Miller:
{quote}
Looking again at the SignatureUpdateProcessor code, I think that indeed this 
won't currently work with distrib updates. Could you file a JIRA issue for 
that? The problem is that we convert update commands into solr documents - and 
that can cause a loss of info if an update proc modifies the update command.

I think the reason that you see a multiple values error when you try the other 
order is because of the lack of a document clone (the other issue I mentioned a 
few emails back). Addressing that won't solve your issue though - we have to 
come up with a way to propagate the currently lost info on the update command.
{quote}

Please see the ML thread for the full discussion: 
http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

        Summary: Distributed deduplication broken when using non-uniqueKey for 
signatureField  (was: Distributed deduplication broken)

> Distributed deduplication broken when using non-uniqueKey for signatureField
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-3473
>                 URL: https://issues.apache.org/jira/browse/SOLR-3473
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud, update
>    Affects Versions: 4.0-ALPHA
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 4.9, 6.0
>
>         Attachments: SOLR-3473-trunk-2.patch, SOLR-3473.patch, SOLR-3473.patch
>
>
> The current state of things (as of 8.8) is that 
> SignatureUpdateProcessorFactory *CAN* be safely used in in SolrCloud for two 
> possible usecases:
>  * For de-duplication:
>  ** the signatureField _MUST_ be the uniqueKey field *AND* the processor 
> _MUST_ be configured to run prior to DistributedUpdateProcessor
>  * Solely for generating signatures, w/o de-duplication
>  ** overwriteDupes _MUST_ be set to false ... any signatureField may be used, 
> and it may run at any point in the processor chain
> If you attempt to use SignatureUpdateProcessorFactory for de-duplication w/ a 
> non-uniqueKey signature field, one of two failure situations are likely to 
> arise:
>  * in a multi-shard collection, documents with identical signatureField 
> values will not be removed from any shard (leader) other then the one the 
> document is routed to (by it's id)
>  * even in a single-shard collection, with multiple replicas, documents with 
> identical signatureField values will *only* be deleted on the 'leader' and 
> not on any other replicas, because the leader does not propogate the 
> {{AddUpdateCommand.updateTerm}} computed by the 
> SignatureUpdateProcessorFactory to each of it's shards
> {panel:title=original bug report}
> Solr's deduplication via the SignatureUpdateProcessor is broken for 
> distributed updates on SolrCloud.
> Mark Miller:
> {quote}
> Looking again at the SignatureUpdateProcessor code, I think that indeed this 
> won't currently work with distrib updates. Could you file a JIRA issue for 
> that? The problem is that we convert update commands into solr documents - 
> and that can cause a loss of info if an update proc modifies the update 
> command.
> I think the reason that you see a multiple values error when you try the 
> other order is because of the lack of a document clone (the other issue I 
> mentioned a few emails back). Addressing that won't solve your issue though - 
> we have to come up with a way to propagate the currently lost info on the 
> update command.
> {quote}
> Please see the ML thread for the full discussion: 
> http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
> {panel}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to