[ 
https://issues.apache.org/jira/browse/SOLR-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143846#comment-14143846
 ] 

Timothy Potter commented on SOLR-6249:
--------------------------------------

This mechanism is mainly a convenience for the client to not have to poll the 
zk version from all replicas themselves. If timeout occurs, then either A) one 
or more of the replicas couldn't process the update or B) one or more of the 
replicas was just being really slow. If A, then really the client app can't 
really proceed safely without resolving the root cause. Not really sure what to 
do about this without going down the path of having a distributed transaction 
that allows us to rollback updates if any replicas fail.

If B, the client can wait more, but then they would have to poll all the 
replicas themselves, which makes client's implement this same polling solution 
on their side as well, thus not very convenient.

One thing would could do to help make it more convenient for dealing with B and 
even possibly A is to use the solution I proposed for SOLR-6550 to pass back 
the URLs of the replicas that timed out using the extended exception metadata. 
That at least narrows the scope for the client but still inconvenient.

Alternatively, async would work but at some point, doesn't the client have to 
give up polling? Hence we're back to effectively having a timeout. I took this 
ticket to mean that a client doesn't want to proceed with more updates until it 
knows all cores have seen the current update, so async seems to just move the 
problem out to the client. 

I'm happy to implement the async approach but from where I sit now, I think we 
should build distributed 2-phase commit transaction support into managed schema 
as it will be useful going forward for managed config. That way, clients can 
make a change and then be certain it was either applied entirely or not at all 
and their cluster remains in a consistent state. This of course would only be 
applied to schema and config changes so I'm not talking about distributed 
transactions for Solr in general.

> Schema API changes return success before all cores are updated
> --------------------------------------------------------------
>
>                 Key: SOLR-6249
>                 URL: https://issues.apache.org/jira/browse/SOLR-6249
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis, SolrCloud
>            Reporter: Gregory Chanan
>            Assignee: Timothy Potter
>         Attachments: SOLR-6249.patch, SOLR-6249.patch
>
>
> See SOLR-6137 for more details.
> The basic issue is that Schema API changes return success when the first core 
> is updated, but other cores asynchronously read the updated schema from 
> ZooKeeper.
> So a client application could make a Schema API change and then index some 
> documents based on the new schema that may fail on other nodes.
> Possible fixes:
> 1) Make the Schema API calls synchronous
> 2) Give the client some ability to track the state of the schema.  They can 
> already do this to a certain extent by checking the Schema API on all the 
> replicas and verifying that the field has been added, though this is pretty 
> cumbersome.  Maybe it makes more sense to do this sort of thing on the 
> collection level, i.e. Schema API changes return the zk version to the 
> client.  We add an API to return the current zk version.  On a replica, if 
> the zk version is >= the version the client has, the client knows that 
> replica has at least seen the schema change.  We could also provide an API to 
> do the distribution and checking across the different replicas of the 
> collection so that clients don't need ot do that themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to