[
https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14680558#comment-14680558
]
Scott Blum commented on SOLR-7869:
----------------------------------
1) bq "Actually Overseer already has code that catches all exceptions and
refreshes the cluster state.
In that case, I would suggest an explicit catch block there for
BadVersionException, and log something milder than error? The current
KeeperException catch block always logs an error.
2) Okay, I totally buy that argument! But see note below.
3) I'm merely saying that the refactoring that would be necessary to make
ZkStateWriter more easily testable might negatively impact the flow of the
existing code. I'm not sure, you'd have to try it.
Having stewed on this a bit, I think there's somewhat of a fundamental
disconnect in how ZkStateWriter's batching interacts with the Overseer work
queue.
Work items get removed from the queue before the corresponding change is
actually written out to the cluster state. You can see in the original design
that there's this neat peek->poll dance in the Overseer loop that attempts to
enforce guaranteed state updates by not discarding the work item until after
the state is written out. But the batching implementation gets rid of this
guarantee, and that's why I perceive we're now in a state where the overseer
updates and external updates can even be in conflict with each other.
Imagine (as a thought experiment) we got rid of batching. If we did that, no
external change could "lose" a work item, because we'd be committing one item
at a time, so the retry operation on a bad version exception would always
re-grab the most recent, unapplied work item. Now obviously, we don't want to
get rid of batching, because efficiency. But I really do think we're batching
in the wrong place. I think batching actually needs to happen in Overseer,
because it has to be tied to discarding work items. Ideally, we'd peek N work
items from the head of the queue, setup all the pending updates / cluster state
mods in ZkStateWriter, then try to commit everything. If it succeeds, great,
remove all the processed items from the queue. If it fails, then reread
cluster state and retry all the items again.
Make any sense?
(Now, all that said, that work may be more effort than it's worth, and maybe we
should just focus on not having to use the queue to make cluster state updates
in the format v2 world.)
> Overseer does not handle BadVersionException correctly
> ------------------------------------------------------
>
> Key: SOLR-7869
> URL: https://issues.apache.org/jira/browse/SOLR-7869
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.2.1
> Reporter: Shalin Shekhar Mangar
> Assignee: Shalin Shekhar Mangar
> Labels: difficulty-medium, impact-low
> Fix For: Trunk, 5.4
>
> Attachments: SOLR-7869.patch, SOLR-7869.patch
>
>
> If the /clusterstate.json is modified externally then the Overseer can go
> into an infinite loop upon a BadVersionException alternately trying to
> execute main queue and then the work queue:
> {code}
> ERROR - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer
> work queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /clusterstate.json
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168)
> at java.lang.Thread.run(Thread.java:745)
> INFO - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage:
> queueSize: 1, message = {
> "operation":"state",
> "state":"down",
> "base_url":"http://127.0.1.1:7574/solr",
> "core":"test_shard1_replica1",
> "roles":null,
> "node_name":"127.0.1.1:7574_solr",
> "shard":null,
> "collection":"test",
> "core_node_name":"core_node1"} current state version: 9
> INFO - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null
> message={
> "operation":"state",
> "state":"down",
> "base_url":"http://127.0.1.1:7574/solr",
> "core":"test_shard1_replica1",
> "roles":null,
> "node_name":"127.0.1.1:7574_solr",
> "shard":null,
> "collection":"test",
> "core_node_name":"core_node1"}
> INFO - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already
> registered
> ERROR - 2015-08-04 18:49:56.225; [ ]
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer
> main queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /clusterstate.json
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213)
> at java.lang.Thread.run(Thread.java:745)
> INFO - 2015-08-04 18:49:56.225; [ ]
> org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted
> to ver 8
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]