[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040728#comment-14040728 ] ASF subversion and git services commented on SOLR-6095: --- Commit 1604791 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1604791 ] SOLR-6095 wait for http responses SolrCloud cluster can end up without an overseer with overseer roles - Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 5.0, 4.10 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop exiting :
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040730#comment-14040730 ] ASF subversion and git services commented on SOLR-6095: --- Commit 1604792 from [~noble.paul] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1604792 ] SOLR-6095 wait for http responses SolrCloud cluster can end up without an overseer with overseer roles - Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 5.0, 4.10 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop exiting
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035515#comment-14035515 ] ASF subversion and git services commented on SOLR-6095: --- Commit 1603382 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1603382 ] SOLR-6095 SolrCloud cluster can end up without an overseer with overseer roles SolrCloud cluster can end up without an overseer with overseer roles - Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035520#comment-14035520 ] ASF subversion and git services commented on SOLR-6095: --- Commit 1603383 from [~noble.paul] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1603383 ] SOLR-6095 SolrCloud cluster can end up without an overseer with overseer roles SolrCloud cluster can end up without an overseer with overseer roles - Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 5.0, 4.10 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035785#comment-14035785 ] ASF subversion and git services commented on SOLR-6095: --- Commit 1603467 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1603467 ] SOLR-6095 Uncaught Exception causing test failures SolrCloud cluster can end up without an overseer with overseer roles - Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 5.0, 4.10 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035787#comment-14035787 ] ASF subversion and git services commented on SOLR-6095: --- Commit 1603468 from [~noble.paul] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1603468 ] SOLR-6095 Uncaught Exception causing test failures SolrCloud cluster can end up without an overseer with overseer roles - Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 5.0, 4.10 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer -
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033734#comment-14033734 ] Shalin Shekhar Mangar commented on SOLR-6095: - bq. RollingRestartTest.regularRestartTest() is commented out. If it’s not required, you might want to remove it (or uncomment it and let it run). Yes, it is not required in its current form. We can remove it. SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032386#comment-14032386 ] Noble Paul commented on SOLR-6095: -- I have tweaked the roles feature a bit as follows The new approach If the current order is (everyone who is below is listening to the one right above) # nodeA-0 leader # nodeB-1 # nodeC-2 # nodeD-3 # nodeE-4 And addrole asks *nodeD* to become overseer According to the new approach , a command is sent to nodeD to rejoin election at head, so the new Q becomes # nodeA-0 leader # nodeB-1 nodeD-1 # nodeC-2 # nodeE-4 Now, both nodeB and nodeD are waiting on *nodeA* to become the leader The next step is to send a rejoin (not at head) command to *nodeB* . So the new order automatically is as follows where nodeD is the next node in line to become the leader. # nodeA-0 leader # nodeD-1 # nodeC-2 # nodeE-4 # nodeB-5 The next step is to send a quit command to nodeA (current leader) . So the new order becomes # nodeD-1 leader # nodeC-2 # nodeE-4 # nodeB-5 # nodeA-6 So we have promoted *nodeD* to leader with just 3 operations . The advantage is that , irrespective of the no:of nodes in the queue , the no:of operations is still the same 3 , So it does not matter if it is a big cluster or small. The good thing is there will never be a loss of overseer , even if the designate does not become the leader (because of errors happening in the prioritizeOverseerNodes) SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032616#comment-14032616 ] Jessica Cheng commented on SOLR-6095: - What if before step 2 nodeA dies. Is there a possibility that we'd end up with two Overseers (nodeB and nodeD)? What's done to prevent this from happening? SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr 2014-05-20 09:55:39,256 [main-EventThread]
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032619#comment-14032619 ] Noble Paul commented on SOLR-6095: -- bq,Is there a possibility that we'd end up with two Overseers (nodeB and nodeD)? No , only one can succeed. If nodeD succeeds it is great, If it does not, * nodeB will become Overseer * nodeD will rejoin at the back * and nodeB will go through all the same steps as explained above SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032659#comment-14032659 ] Jessica Cheng commented on SOLR-6095: - {quote} nodeD will rejoin at the back because the , leader node already exists (created by nodeB) {quote} When does this happen? The classic zk leader election recipe would not have checked for the leader node. In LeaderElector.checkIfIAmLeader, the node with the smallest seqId deletes the leader node without looking at it before writing itself down as the leader. If the first node that wrote itself down as the leader already passed the amILeader() check in the Overseer loop before the second node overwrites it, it is then possible that the first node will be active for at least one loop iteration while the second node becomes the new leader. Secondly, if both of these nodes already get to this loop, when one does exit the Overseer loop, when does it rejoin the election (I don't see any code in Overseer loops that rejoins election)? {quote} and nodeB will go through all the same steps as explained above {quote} Even if say what you describe above worked, here nodeB gets re-prioritized down and then nodeC becomes the leader, we still don't have the right results. What happens then? SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032807#comment-14032807 ] Jessica Cheng commented on SOLR-6095: - I see that your new patch is trying to fix the seq = intSeqs.get(0) case in LeaderElector, but the fix doesn't quite work. Note that the delete statement is meant to delete the old leader's node in case it hasn't expired yet, which is a possible scenario. If the old leader's node indeed hasn't expired, both nodeB and nodeD will fail your new statement. SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032814#comment-14032814 ] Jessica Cheng commented on SOLR-6095: - Sorry that wasn't true. You were comparing election path, not the leader node. However, this still possibly doesn't work because sortSeqs extracts just the sequence number (n_01) out of the entire node string and sorts based on that, so in fact the sort order of nodeB and nodeD might not be deterministic in different JVM, which makes this new if statement also non-deterministic. SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032836#comment-14032836 ] Noble Paul commented on SOLR-6095: -- I haven't really gone into the implementation of the Arrays.sort() . But as long as the getChildren returns the nodes in the same order , the Arrays.sort() would give the same order right? Because ZK does not sort based on the sequence number But again , this solution does not give 100% guarantee that nodeD would become the leader ,if the last step quit command is not executed . So, there is a very small possibility that the overseer is not a designate, but there will always be a leader. Only if the leader quits , because of an explicit rejoin coreadmin command, or if the node dies SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032868#comment-14032868 ] Jessica Cheng commented on SOLR-6095: - The problem is I don't think getChildren returns nodes in the same order. Its javadoc states {quote} The list of children returned is not sorted and no guarantee is provided as to its natural or lexical order. {quote} If somehow getChildren doesn't return nodes in the same order (unless we can verify otherwise, and add this as a regression test against each zk upgrade since the API doesn't guarantee it), the sort can possibly get different ordering of nodeB and nodeD so that they both believe they're the top item in their own invocation, and we're back to the temporary two-Overseer case (for one loop iteration). SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032923#comment-14032923 ] Jessica Cheng commented on SOLR-6095: - Just checked zookeeper's code. Its children is held by HashSet in DataNode, which means that if you hit different instances of zookeeper in the ensemble, you may get different results ordering back. SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop exiting :
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033451#comment-14033451 ] Anshum Gupta commented on SOLR-6095: RollingRestartTest.regularRestartTest() is commented out. If it’s not required, you might want to remove it (or uncomment it and let it run). SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr 2014-05-20 09:55:39,256
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014535#comment-14014535 ] Shalin Shekhar Mangar commented on SOLR-6095: - Except we don't do our rolling restarts like that. Our restart scripts iterates through hosts looked up using EC2 APIs (and it almost always returns the node names in the same order) and restarting them one by one and after each restart, waits for 60 seconds, verifies that node is up again and continues with the next host. Since the script originally created the nodes too in the same order, the election nodes are also approximately in the same order. This causes each host restart to displace the overseer to the next host in line which is again displaced and so on. SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014780#comment-14014780 ] Ramkumar Aiyengar commented on SOLR-6095: - That would explain it, our start script blocks until all cores are active, hence we don't have this issue.. SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr 2014-05-20 09:55:39,256 [main-EventThread] WARN common.cloud.ZkStateReader - ZooKeeper watch triggered, but Solr cannot talk to ZK
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014137#comment-14014137 ] Ramkumar Aiyengar commented on SOLR-6095: - Not sure I understand. You bring down first wave, overseers move to second wave. When you bring back first wave, they use the overseer in the second wave to recover and become active. Then you start with the second wave. Why would this be a problem? SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012802#comment-14012802 ] Mark Miller commented on SOLR-6095: --- {quote}We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly.{quote} Do you know if there is a JIRA issue for that? SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr 2014-05-20 09:55:39,256 [main-EventThread] WARN
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013306#comment-14013306 ] Shalin Shekhar Mangar commented on SOLR-6095: - No, I don't think there's a jira for it. The reason that we could find was that if for some reason the rolling restart sequence matches with the overseer election sequence then the overseer keep shifting with each bounce and are unable to process events. This is kinda okay in small clusters but in large clusters, by the time the rolling restarts complete, some nodes reach recovery_failed state and won't try to come back up again. Once we changed our restart sequence to restart the overseer node at the very last, we did not encounter this problem any more. SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Assignee: Noble Paul Fix For: 4.9, 5.0 We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003046#comment-14003046 ] Shalin Shekhar Mangar commented on SOLR-6095: - I also opened SOLR-6091 but that didn't help. SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Fix For: 4.9, 5.0 We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767) ... 4 more 2014-05-20 09:55:39,254 [Thread-63] INFO solr.cloud.Overseer - Overseer Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr 2014-05-20 09:55:39,256 [main-EventThread] WARN common.cloud.ZkStateReader - ZooKeeper watch triggered, but Solr cannot talk to ZK 2014-05-20 09:55:39,259 [ShutdownMonitor] INFO server.handler.ContextHandler - stopped
[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer
[ https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003585#comment-14003585 ] Shalin Shekhar Mangar commented on SOLR-6095: - The problem that I could find is in LeaderElector.checkIfIamLeader where we have the following code: {code} if (seq = intSeqs.get(0)) { // first we delete the node advertising the old leader in case the ephem is still there try { zkClient.delete(context.leaderPath, -1, true); } catch(Exception e) { // fine } runIamLeaderProcess(context, replacement); } {code} If for whatever reason, the zkClient.delete was unsuccessful, we just ignore and go ahead to runIamLeaderProcess(...) which leads to OverseerElectionContext.runLeaderProcess(...) where it tries to create the /overseer_elect/leader node: {code} zkClient.makePath(leaderPath, ZkStateReader.toJSON(myProps), CreateMode.EPHEMERAL, true); {code} This is where things go wrong. Because the /overseer_elect/leader node already existed, the zkClient.makePath fails and the node decides to give up because it think that there is already a leader. It never tries to rejoin election ever. Then once the ephemeral /overseer_elect/leader node goes away (after the previous overseer leader exits), the cluster is left with no leader. Shouldn't the node next in line to become a leader try again or rejoin the election instead of giving up? SolrCloud cluster can end up without an overseer Key: SOLR-6095 URL: https://issues.apache.org/jira/browse/SOLR-6095 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.8 Reporter: Shalin Shekhar Mangar Fix For: 4.9, 5.0 We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly. This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature. As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer: {code} 2014-05-20 09:55:39,261 [main-EventThread] INFO solr.cloud.ElectionContext - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr 2014-05-20 09:55:39,265 [main-EventThread] WARN solr.cloud.LeaderElector - org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386) at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373) at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551) at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142) at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110) at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55) at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} When the overseer leader node is gracefully shutdown, we get the following in the logs: {code} 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer - Exception in Overseer main queue loop org.apache.solr.common.SolrException: Could not load collection from ZK:sm12 at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553) at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246) at