[
https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679667#comment-14679667
]
Shalin Shekhar Mangar commented on SOLR-7869:
---------------------------------------------
bq. 1) So this doesn't actually fix anything yet, because there are no changes
to Overseer itself? Presumably you'd need to catch BVE in overseer and
force-refresh reader clusterState?
Actually Overseer already has code that catches all exceptions and refreshes
the cluster state. The bug is in ZkStateWriter which does not use the refreshed
cluster state if maybeRefreshBefore returns true and hence tries to
compare-and-set using the outdated cluster state version.
bq. 2) Just noting that this seems the opposite of what we discussed earlier. I
interpreted your earlier comments to mean that we should blow away the ZK data
in favor of the overseer data, since overseer is authoritative. This patch
seems do the opposite, preferring external user changes. To wit "it is
guaranteed that overwriting cluster state with prevState will not discard any
updates that Overseer had performed unless such an act was done externally by
the user".
Maybe I wasn't clear enough. But I did mean the opposite of what you
understood. The user has made some changes either accidentally in which case
they totally deserve what's coming :) or presumably to fix something that went
wrong in the cluster state (which could be because of a genuine bug). In both
cases, overwriting stuff that a user has himself done seems wrong to me. We
should just roll with it. Therefore the overseer refreshes the cluster state
and starts using it as the base for future operations.
bq. 3) In ZkStateWriterTest, I note that ZkStateWriter isn't super amenable to
testing, it's kind of subtle that enqueuing an update sometimes causes a flush,
and sometimes does. Dunno if it's better or worse to have test-visible methods
for doing a queue-without-flush and then explicit flush.
You are right. The testZkStateWriterBatching is a horrible test and I should
have written a better one. In particular, maybeFlushAfter also updates the
local state (lastStateFormat, lastCollectionName) before the write happens. We
should change that. But I am not sure what you mean by "Dunno if it's better or
worse to have test-visible methods for doing a queue-without-flush and then
explicit flush."?
Re: #4 and #5 -- good point. I'll fix that.
> Overseer does not handle BadVersionException correctly
> ------------------------------------------------------
>
> Key: SOLR-7869
> URL: https://issues.apache.org/jira/browse/SOLR-7869
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.2.1
> Reporter: Shalin Shekhar Mangar
> Assignee: Shalin Shekhar Mangar
> Labels: difficulty-medium, impact-low
> Fix For: Trunk, 5.4
>
> Attachments: SOLR-7869.patch, SOLR-7869.patch
>
>
> If the /clusterstate.json is modified externally then the Overseer can go
> into an infinite loop upon a BadVersionException alternately trying to
> execute main queue and then the work queue:
> {code}
> ERROR - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer
> work queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /clusterstate.json
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168)
> at java.lang.Thread.run(Thread.java:745)
> INFO - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage:
> queueSize: 1, message = {
> "operation":"state",
> "state":"down",
> "base_url":"http://127.0.1.1:7574/solr",
> "core":"test_shard1_replica1",
> "roles":null,
> "node_name":"127.0.1.1:7574_solr",
> "shard":null,
> "collection":"test",
> "core_node_name":"core_node1"} current state version: 9
> INFO - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null
> message={
> "operation":"state",
> "state":"down",
> "base_url":"http://127.0.1.1:7574/solr",
> "core":"test_shard1_replica1",
> "roles":null,
> "node_name":"127.0.1.1:7574_solr",
> "shard":null,
> "collection":"test",
> "core_node_name":"core_node1"}
> INFO - 2015-08-04 18:49:56.224; [ ]
> org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already
> registered
> ERROR - 2015-08-04 18:49:56.225; [ ]
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer
> main queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /clusterstate.json
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
> at
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
> at
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
> at
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
> at
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213)
> at java.lang.Thread.run(Thread.java:745)
> INFO - 2015-08-04 18:49:56.225; [ ]
> org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted
> to ver 8
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]