[jira] [Updated] (SOLR-7869) Overseer does not handle BadVersionException correctly
[ https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-7869: Attachment: SOLR-7869.patch Here's a better fix which discards ZkStateWriter on a BadVersionException and starts afresh. The previous approach didn't work when an external change was made on state.json with no changes to /clusterstate.json. Although such changes can be detected and resolved inside ZkStateWriter but that would make this class unnecessarily complex. ZkStateWriter will put itself into an invalid state upon a BadVersionException and will disallow all future operations. Callers are expected to discard such an instance and create a fresh ZkStateWriter instance for future use. I added two tests in ZkStateWriterTest which simulate an external change to /clusterstate.json and a state.json and asserts that an IllegalStateException is thrown on any future invocation of enqueueUpdate or writePendingUpdates. I also added a test in Overseer which asserts that the overseer can keep processing events on a BadVersionException (indirectly testing that a fresh ZkStateWriter is created upon said exception). I also added copious amounts of javadocs to the ZkStateWriter class for future reference. Overseer does not handle BadVersionException correctly -- Key: SOLR-7869 URL: https://issues.apache.org/jira/browse/SOLR-7869 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.2.1 Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Labels: difficulty-medium, impact-low Fix For: Trunk, 5.4 Attachments: SOLR-7869.patch, SOLR-7869.patch, SOLR-7869.patch If the /clusterstate.json is modified externally then the Overseer can go into an infinite loop upon a BadVersionException alternately trying to execute main queue and then the work queue: {code} ERROR - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer work queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359) at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180) at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168) at java.lang.Thread.run(Thread.java:745) INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = { operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} current state version: 9 INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null message={ operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already registered ERROR - 2015-08-04 18:49:56.225; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer main queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at
[jira] [Updated] (SOLR-7869) Overseer does not handle BadVersionException correctly
[ https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-7869: Attachment: SOLR-7869.patch Thanks for the review Scott! Both of your comments are now incorporated into this patch. I'll run precommit and tests and commit once they succeed. Overseer does not handle BadVersionException correctly -- Key: SOLR-7869 URL: https://issues.apache.org/jira/browse/SOLR-7869 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.2.1 Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Labels: difficulty-medium, impact-low Fix For: Trunk, 5.4 Attachments: SOLR-7869.patch, SOLR-7869.patch, SOLR-7869.patch, SOLR-7869.patch If the /clusterstate.json is modified externally then the Overseer can go into an infinite loop upon a BadVersionException alternately trying to execute main queue and then the work queue: {code} ERROR - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer work queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359) at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180) at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168) at java.lang.Thread.run(Thread.java:745) INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = { operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} current state version: 9 INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null message={ operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already registered ERROR - 2015-08-04 18:49:56.225; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer main queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359) at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180) at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213) at java.lang.Thread.run(Thread.java:745) INFO - 2015-08-04 18:49:56.225; [ ] org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted to ver 8 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] [Updated] (SOLR-7869) Overseer does not handle BadVersionException correctly
[ https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-7869: Fix Version/s: (was: 5.3) 5.4 Overseer does not handle BadVersionException correctly -- Key: SOLR-7869 URL: https://issues.apache.org/jira/browse/SOLR-7869 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.2.1 Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Labels: difficulty-medium, impact-low Fix For: Trunk, 5.4 Attachments: SOLR-7869.patch If the /clusterstate.json is modified externally then the Overseer can go into an infinite loop upon a BadVersionException alternately trying to execute main queue and then the work queue: {code} ERROR - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer work queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359) at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180) at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168) at java.lang.Thread.run(Thread.java:745) INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = { operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} current state version: 9 INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null message={ operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already registered ERROR - 2015-08-04 18:49:56.225; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer main queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359) at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180) at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213) at java.lang.Thread.run(Thread.java:745) INFO - 2015-08-04 18:49:56.225; [ ] org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted to ver 8 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7869) Overseer does not handle BadVersionException correctly
[ https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-7869: Attachment: SOLR-7869.patch Test + fix. I tried to reproduce this in the OverseerTest but it was proving to be too difficult. The randomized test I had would maybe reproduce once in 5 times so I went back to the test Scott had written and augmented it. # I wonder if it is better to assert that given cluster state version is greater than ZkStateWriter's internal cluster state instead of blindly using given cluster state when version is not equal. # I also wonder if a better fix is to re-create ZkStateWriter object entirely if refreshClusterState is true in the Overseer? The reason is what if a user modifies a collection's state.json directly but doesn't modify the /clusterstate.json. In that case, our current fix won't work. Overseer does not handle BadVersionException correctly -- Key: SOLR-7869 URL: https://issues.apache.org/jira/browse/SOLR-7869 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.2.1 Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Labels: difficulty-medium, impact-low Fix For: Trunk, 5.4 Attachments: SOLR-7869.patch, SOLR-7869.patch If the /clusterstate.json is modified externally then the Overseer can go into an infinite loop upon a BadVersionException alternately trying to execute main queue and then the work queue: {code} ERROR - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer work queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359) at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180) at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168) at java.lang.Thread.run(Thread.java:745) INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = { operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} current state version: 9 INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null message={ operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already registered ERROR - 2015-08-04 18:49:56.225; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer main queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359) at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180) at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67) at
[jira] [Updated] (SOLR-7869) Overseer does not handle BadVersionException correctly
[ https://issues.apache.org/jira/browse/SOLR-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-7869: - Attachment: SOLR-7869.patch Attached a TEST ONLY that repros the failure. This is not a fix. Overseer does not handle BadVersionException correctly -- Key: SOLR-7869 URL: https://issues.apache.org/jira/browse/SOLR-7869 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.2.1 Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Labels: difficulty-medium, impact-low Fix For: 5.3, Trunk Attachments: SOLR-7869.patch If the /clusterstate.json is modified externally then the Overseer can go into an infinite loop upon a BadVersionException alternately trying to execute main queue and then the work queue: {code} ERROR - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer work queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359) at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180) at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168) at java.lang.Thread.run(Thread.java:745) INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = { operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} current state version: 9 INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null message={ operation:state, state:down, base_url:http://127.0.1.1:7574/solr;, core:test_shard1_replica1, roles:null, node_name:127.0.1.1:7574_solr, shard:null, collection:test, core_node_name:core_node1} INFO - 2015-08-04 18:49:56.224; [ ] org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already registered ERROR - 2015-08-04 18:49:56.225; [ ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer main queue loop org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359) at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180) at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213) at java.lang.Thread.run(Thread.java:745) INFO - 2015-08-04 18:49:56.225; [ ] org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted to ver 8 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org