[ 
https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14680558#comment-14680558
 ] 

Scott Blum commented on SOLR-7869:
----------------------------------

1) bq "Actually Overseer already has code that catches all exceptions and 
refreshes the cluster state.

In that case, I would suggest an explicit catch block there for 
BadVersionException, and log something milder than error?  The current 
KeeperException catch block always logs an error.

2) Okay, I totally buy that argument!  But see note below.

3) I'm merely saying that the refactoring that would be necessary to make 
ZkStateWriter more easily testable might negatively impact the flow of the 
existing code.  I'm not sure, you'd have to try it.


Having stewed on this a bit, I think there's somewhat of a fundamental 
disconnect in how ZkStateWriter's batching interacts with the Overseer work 
queue.

Work items get removed from the queue before the corresponding change is 
actually written out to the cluster state.  You can see in the original design 
that there's this neat peek->poll dance in the Overseer loop that attempts to 
enforce guaranteed state updates by not discarding the work item until after 
the state is written out.  But the batching implementation gets rid of this 
guarantee, and that's why I perceive we're now in a state where the overseer 
updates and external updates can even be in conflict with each other.

Imagine (as a thought experiment) we got rid of batching.  If we did that, no 
external change could "lose" a work item, because we'd be committing one item 
at a time, so the retry operation on a bad version exception would always 
re-grab the most recent, unapplied work item.  Now obviously, we don't want to 
get rid of batching, because efficiency.  But I really do think we're batching 
in the wrong place.  I think batching actually needs to happen in Overseer, 
because it has to be tied to discarding work items.  Ideally, we'd peek N work 
items from the head of the queue, setup all the pending updates / cluster state 
mods in ZkStateWriter, then try to commit everything.  If it succeeds, great, 
remove all the processed items from the queue.  If it fails, then reread 
cluster state and retry all the items again.

Make any sense?

(Now, all that said, that work may be more effort than it's worth, and maybe we 
should just focus on not having to use the queue to make cluster state updates 
in the format v2 world.)


> Overseer does not handle BadVersionException correctly
> ------------------------------------------------------
>
>                 Key: SOLR-7869
>                 URL: https://issues.apache.org/jira/browse/SOLR-7869
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2.1
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-low
>             Fix For: Trunk, 5.4
>
>         Attachments: SOLR-7869.patch, SOLR-7869.patch
>
>
> If the /clusterstate.json is modified externally then the Overseer can go 
> into an infinite loop upon a BadVersionException alternately trying to 
> execute main queue and then the work queue:
> {code}
> ERROR - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer 
> work queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /clusterstate.json
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: 
> queueSize: 1, message = {
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr";,
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"} current state version: 9
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null 
> message={
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr";,
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"}
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already 
> registered
> ERROR - 2015-08-04 18:49:56.225; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer 
> main queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /clusterstate.json
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.225; [   ] 
> org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted 
> to ver 8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to