[ 
https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-7869:
----------------------------------------
    Attachment: SOLR-7869.patch

Here's a better fix which discards ZkStateWriter on a BadVersionException and 
starts afresh. The previous approach didn't work when an external change was 
made on state.json with no changes to /clusterstate.json. Although such changes 
can be detected and resolved inside ZkStateWriter but that would make this 
class unnecessarily complex.

ZkStateWriter will put itself into an invalid state upon a BadVersionException 
and will disallow all future operations. Callers are expected to discard such 
an instance and create a fresh ZkStateWriter instance for future use.

I added two tests in ZkStateWriterTest which simulate an external change to 
/clusterstate.json and a state.json and asserts that an IllegalStateException 
is thrown on any future invocation of enqueueUpdate or writePendingUpdates.

I also added a test in Overseer which asserts that the overseer can keep 
processing events on a BadVersionException (indirectly testing that a fresh 
ZkStateWriter is created upon said exception).

I also added copious amounts of javadocs to the ZkStateWriter class for future 
reference.

> Overseer does not handle BadVersionException correctly
> ------------------------------------------------------
>
>                 Key: SOLR-7869
>                 URL: https://issues.apache.org/jira/browse/SOLR-7869
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2.1
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-low
>             Fix For: Trunk, 5.4
>
>         Attachments: SOLR-7869.patch, SOLR-7869.patch, SOLR-7869.patch
>
>
> If the /clusterstate.json is modified externally then the Overseer can go 
> into an infinite loop upon a BadVersionException alternately trying to 
> execute main queue and then the work queue:
> {code}
> ERROR - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer 
> work queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /clusterstate.json
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: 
> queueSize: 1, message = {
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr";,
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"} current state version: 9
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null 
> message={
>   "operation":"state",
>   "state":"down",
>   "base_url":"http://127.0.1.1:7574/solr";,
>   "core":"test_shard1_replica1",
>   "roles":null,
>   "node_name":"127.0.1.1:7574_solr",
>   "shard":null,
>   "collection":"test",
>   "core_node_name":"core_node1"}
> INFO  - 2015-08-04 18:49:56.224; [   ] 
> org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already 
> registered
> ERROR - 2015-08-04 18:49:56.225; [   ] 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer 
> main queue loop
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /clusterstate.json
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
>         at 
> org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
>         at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
>         at 
> org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
>         at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
>         at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213)
>         at java.lang.Thread.run(Thread.java:745)
> INFO  - 2015-08-04 18:49:56.225; [   ] 
> org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted 
> to ver 8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to