[ 
https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679667#comment-14679667
 ] 

Shalin Shekhar Mangar commented on SOLR-7869:
---------------------------------------------

bq. 1) So this doesn't actually fix anything yet, because there are no changes 
to Overseer itself? Presumably you'd need to catch BVE in overseer and 
force-refresh reader clusterState?

Actually Overseer already has code that catches all exceptions and refreshes 
the cluster state. The bug is in ZkStateWriter which does not use the refreshed 
cluster state if maybeRefreshBefore returns true and hence tries to 
compare-and-set using the outdated cluster state version.

bq. 2) Just noting that this seems the opposite of what we discussed earlier. I 
interpreted your earlier comments to mean that we should blow away the ZK data 
in favor of the overseer data, since overseer is authoritative. This patch 
seems do the opposite, preferring external user changes. To wit "it is 
guaranteed that overwriting cluster state with prevState will not discard any 
updates that Overseer had performed unless such an act was done externally by 
the user".

Maybe I wasn't clear enough. But I did mean the opposite of what you 
understood. The user has made some changes either accidentally in which case 
they totally deserve what's coming :) or presumably to fix something that went 
wrong in the cluster state (which could be because of a genuine bug). In both 
cases, overwriting stuff that a user has himself done seems wrong to me. We 
should just roll with it. Therefore the overseer refreshes the cluster state 
and starts using it as the base for future operations.

bq. 3) In ZkStateWriterTest, I note that ZkStateWriter isn't super amenable to 
testing, it's kind of subtle that enqueuing an update sometimes causes a flush, 
and sometimes does. Dunno if it's better or worse to have test-visible methods 
for doing a queue-without-flush and then explicit flush.

You are right. The testZkStateWriterBatching is a horrible test and I should 
have written a better one. In particular, maybeFlushAfter also updates the 
local state (lastStateFormat, lastCollectionName) before the write happens. We 
should change that. But I am not sure what you mean by "Dunno if it's better or 
worse to have test-visible methods for doing a queue-without-flush and then 
explicit flush."?

Re: #4 and #5 -- good point. I'll fix that.


> Overseer does not handle BadVersionException correctly
> ------------------------------------------------------
>
>                 Key: SOLR-7869
>                 URL: https://issues.apache.org/jira/browse/SOLR-7869
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2.1
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-low
>             Fix For: Trunk, 5.4
>
>         Attachments: SOLR-7869.patch, SOLR-7869.patch
>
>
> If the /clusterstate.json is modified externally then the Overseer can go 
> into an infinite loop upon a BadVersionException alternately trying to 
> execute main queue and then the work queue:
> {code}
> ERROR - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer 
> work queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /clusterstate.json
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: 
> queueSize: 1, message = {
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr";,
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"} current state version: 9
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null 
> message={
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr";,
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"}
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already 
> registered
> ERROR - 2015-08-04 18:49:56.225; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer 
> main queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /clusterstate.json
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.225; [   ] 
> org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted 
> to ver 8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to