[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles

2014-06-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040728#comment-14040728
 ] 

ASF subversion and git services commented on SOLR-6095:
---

Commit 1604791 from [~noble.paul] in branch 'dev/trunk'
[ https://svn.apache.org/r1604791 ]

SOLR-6095 wait for http responses

 SolrCloud cluster can end up without an overseer with overseer roles 
 -

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 5.0, 4.10

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, 
 SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop exiting : 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles

2014-06-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040730#comment-14040730
 ] 

ASF subversion and git services commented on SOLR-6095:
---

Commit 1604792 from [~noble.paul] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1604792 ]

SOLR-6095 wait for http responses

 SolrCloud cluster can end up without an overseer with overseer roles 
 -

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 5.0, 4.10

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, 
 SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop exiting 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles

2014-06-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035515#comment-14035515
 ] 

ASF subversion and git services commented on SOLR-6095:
---

Commit 1603382 from [~noble.paul] in branch 'dev/trunk'
[ https://svn.apache.org/r1603382 ]

SOLR-6095 SolrCloud cluster can end up without an overseer with overseer roles

 SolrCloud cluster can end up without an overseer with overseer roles 
 -

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, 
 SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles

2014-06-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035520#comment-14035520
 ] 

ASF subversion and git services commented on SOLR-6095:
---

Commit 1603383 from [~noble.paul] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1603383 ]

SOLR-6095 SolrCloud cluster can end up without an overseer with overseer roles

 SolrCloud cluster can end up without an overseer with overseer roles 
 -

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 5.0, 4.10

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, 
 SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles

2014-06-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035785#comment-14035785
 ] 

ASF subversion and git services commented on SOLR-6095:
---

Commit 1603467 from [~noble.paul] in branch 'dev/trunk'
[ https://svn.apache.org/r1603467 ]

SOLR-6095 Uncaught Exception causing test failures

 SolrCloud cluster can end up without an overseer with overseer roles 
 -

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 5.0, 4.10

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, 
 SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer with overseer roles

2014-06-18 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035787#comment-14035787
 ] 

ASF subversion and git services commented on SOLR-6095:
---

Commit 1603468 from [~noble.paul] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1603468 ]

SOLR-6095 Uncaught Exception causing test failures

 SolrCloud cluster can end up without an overseer with overseer roles 
 -

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 5.0, 4.10

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, 
 SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-17 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033734#comment-14033734
 ] 

Shalin Shekhar Mangar commented on SOLR-6095:
-

bq. RollingRestartTest.regularRestartTest() is commented out. If it’s not 
required, you might want to remove it (or uncomment it and let it run).

Yes, it is not required in its current form. We can remove it.

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, 
 SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032386#comment-14032386
 ] 

Noble Paul commented on SOLR-6095:
--

I have tweaked the roles feature a bit as follows

The new approach 

If the current order is (everyone who is below is listening to the one right 
above)

# nodeA-0 leader
# nodeB-1
# nodeC-2
# nodeD-3
# nodeE-4

And addrole asks *nodeD* to become overseer


According to the new approach , a command is sent to nodeD to rejoin election 
at head, so the new Q becomes 

# nodeA-0 leader
# nodeB-1  nodeD-1
# nodeC-2
# nodeE-4

Now, both nodeB and nodeD are waiting on *nodeA* to become the leader

The next step is to send a rejoin (not at head) command to *nodeB* . So the new 
order automatically is as follows where nodeD is the next node in line to 
become the leader. 

# nodeA-0 leader
# nodeD-1
# nodeC-2
# nodeE-4
# nodeB-5


The next step is to send a quit command to nodeA (current leader) . So the new 
order becomes


# nodeD-1 leader
# nodeC-2
# nodeE-4
# nodeB-5
# nodeA-6 

So we have promoted *nodeD* to leader with just 3 operations . The advantage is 
that , irrespective of the no:of nodes in the queue , the no:of operations is 
still the same 3 , So it does not matter if it is a big cluster or small. The 
good thing is there will never be a loss of overseer , even if the designate 
does not become the leader (because of errors happening in the 
prioritizeOverseerNodes)
 

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032616#comment-14032616
 ] 

Jessica Cheng commented on SOLR-6095:
-

What if before step 2 nodeA dies. Is there a possibility that we'd end up with 
two Overseers (nodeB and nodeD)? What's done to prevent this from happening?

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr
 2014-05-20 09:55:39,256 [main-EventThread] 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032619#comment-14032619
 ] 

Noble Paul commented on SOLR-6095:
--

bq,Is there a possibility that we'd end up with two Overseers (nodeB and 
nodeD)? 

No , only one can succeed. If nodeD succeeds it is great, 
If it does not,  
 * nodeB will become Overseer
 * nodeD will rejoin at the back 
 * and nodeB will go through all the same steps as explained above

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032659#comment-14032659
 ] 

Jessica Cheng commented on SOLR-6095:
-

{quote}
nodeD will rejoin at the back because the , leader node already exists 
(created by nodeB)
{quote}
When does this happen? The classic zk leader election recipe would not have 
checked for the leader node. In LeaderElector.checkIfIAmLeader, the node with 
the smallest seqId deletes the leader node without looking at it before writing 
itself down as the leader. If the first node that wrote itself down as the 
leader already passed the amILeader() check in the Overseer loop before the 
second node overwrites it, it is then possible that the first node will be 
active for at least one loop iteration while the second node becomes the new 
leader. Secondly, if both of these nodes already get to this loop, when one 
does exit the Overseer loop, when does it rejoin the election (I don't see any 
code in Overseer loops that rejoins election)?

{quote}
and nodeB will go through all the same steps as explained above
{quote}
Even if say what you describe above worked, here nodeB gets re-prioritized down 
and then nodeC becomes the leader, we still don't have the right results. What 
happens then?

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032807#comment-14032807
 ] 

Jessica Cheng commented on SOLR-6095:
-

I see that your new patch is trying to fix the seq = intSeqs.get(0) case in 
LeaderElector, but the fix doesn't quite work. Note that the delete statement 
is meant to delete the old leader's node in case it hasn't expired yet, which 
is a possible scenario. If the old leader's node indeed hasn't expired, both 
nodeB and nodeD will fail your new statement.

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032814#comment-14032814
 ] 

Jessica Cheng commented on SOLR-6095:
-

Sorry that wasn't true. You were comparing election path, not the leader node. 
However, this still possibly doesn't work because sortSeqs extracts just the 
sequence number (n_01) out of the entire node string and sorts based on 
that, so in fact the sort order of nodeB and nodeD might not be deterministic 
in different JVM, which makes this new if statement also non-deterministic.

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032836#comment-14032836
 ] 

Noble Paul commented on SOLR-6095:
--

I haven't really gone into the implementation of the Arrays.sort() . But as 
long as the getChildren returns the nodes in the same order , the Arrays.sort() 
would give the same order right? Because ZK does not sort based on the sequence 
number

But again , this solution does not give 100% guarantee that nodeD would become 
the leader ,if the last step quit command is not executed . So, there is a very 
small possibility that the overseer is not a designate, but there will always 
be a leader. Only if the leader quits , because of an explicit rejoin coreadmin 
command, or if the node dies

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032868#comment-14032868
 ] 

Jessica Cheng commented on SOLR-6095:
-

The problem is I don't think getChildren returns nodes in the same order. Its 
javadoc states

{quote}
The list of children returned is not sorted and no guarantee is provided as to 
its natural or lexical order.
{quote}

If somehow getChildren doesn't return nodes in the same order (unless we can 
verify otherwise, and add this as a regression test against each zk upgrade 
since the API doesn't guarantee it), the sort can possibly get different 
ordering of nodeB and nodeD so that they both believe they're the top item in 
their own invocation, and we're back to the temporary two-Overseer case (for 
one loop iteration).

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032923#comment-14032923
 ] 

Jessica Cheng commented on SOLR-6095:
-

Just checked zookeeper's code. Its children is held by HashSet in DataNode, 
which means that if you hit different instances of zookeeper in the ensemble, 
you may get different results ordering back.

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop exiting : 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-06-16 Thread Anshum Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033451#comment-14033451
 ] 

Anshum Gupta commented on SOLR-6095:


RollingRestartTest.regularRestartTest() is commented out. If it’s not required, 
you might want to remove it (or uncomment it and let it run).

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0

 Attachments: SOLR-6095.patch, SOLR-6095.patch, SOLR-6095.patch, 
 SOLR-6095.patch


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr
 2014-05-20 09:55:39,256 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-05-31 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014535#comment-14014535
 ] 

Shalin Shekhar Mangar commented on SOLR-6095:
-

Except we don't do our rolling restarts like that. Our restart scripts iterates 
through hosts looked up using EC2 APIs (and it almost always returns the node 
names in the same order) and restarting them one by one and after each restart, 
waits for 60 seconds, verifies that node is up again and continues with the 
next host.

Since the script originally created the nodes too in the same order, the 
election nodes are also approximately in the same order. This causes each host 
restart to displace the overseer to the next host in line which is again 
displaced and so on.

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-05-31 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014780#comment-14014780
 ] 

Ramkumar Aiyengar commented on SOLR-6095:
-

That would explain it, our start script blocks until all cores are active, 
hence we don't have this issue..

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr
 2014-05-20 09:55:39,256 [main-EventThread] WARN  common.cloud.ZkStateReader  
 - ZooKeeper watch triggered, but Solr cannot talk to ZK
 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-05-30 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014137#comment-14014137
 ] 

Ramkumar Aiyengar commented on SOLR-6095:
-

Not sure I understand. You bring down first wave, overseers move to second 
wave. When you bring back first wave, they use the overseer in the second wave 
to recover and become active. Then you start with the second wave. Why would 
this be a problem?

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-05-29 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012802#comment-14012802
 ] 

Mark Miller commented on SOLR-6095:
---

{quote}We always restart our overseer nodes at the very last otherwise we end 
up with a large number of shards that can't recover properly.{quote}

Do you know if there is a JIRA issue for that?

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr
 2014-05-20 09:55:39,256 [main-EventThread] WARN  

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-05-29 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013306#comment-14013306
 ] 

Shalin Shekhar Mangar commented on SOLR-6095:
-

No, I don't think there's a jira for it. The reason that we could find was that 
if for some reason the rolling restart sequence matches with the overseer 
election sequence then the overseer keep shifting with each bounce and are 
unable to process events. This is kinda okay in small clusters but in large 
clusters, by the time the rolling restarts complete, some nodes reach 
recovery_failed state and won't try to come back up again.

Once we changed our restart sequence to restart the overseer node at the very 
last, we did not encounter this problem any more.

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
Assignee: Noble Paul
 Fix For: 4.9, 5.0


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-05-20 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003046#comment-14003046
 ] 

Shalin Shekhar Mangar commented on SOLR-6095:
-

I also opened SOLR-6091 but that didn't help.

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
 Fix For: 4.9, 5.0


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at 
 org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
   at 
 org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
   ... 4 more
 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
 Loop exiting : ec2-xx.compute-1.amazonaws.com:8986_solr
 2014-05-20 09:55:39,256 [main-EventThread] WARN  common.cloud.ZkStateReader  
 - ZooKeeper watch triggered, but Solr cannot talk to ZK
 2014-05-20 09:55:39,259 [ShutdownMonitor] INFO  server.handler.ContextHandler 
  - stopped 
 

[jira] [Commented] (SOLR-6095) SolrCloud cluster can end up without an overseer

2014-05-20 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003585#comment-14003585
 ] 

Shalin Shekhar Mangar commented on SOLR-6095:
-

The problem that I could find is in LeaderElector.checkIfIamLeader where we 
have the following code:
{code}
if (seq = intSeqs.get(0)) {
  // first we delete the node advertising the old leader in case the ephem 
is still there
  try {
zkClient.delete(context.leaderPath, -1, true);
  } catch(Exception e) {
// fine
  }

  runIamLeaderProcess(context, replacement);
}
{code}

If for whatever reason, the zkClient.delete was unsuccessful, we just ignore 
and go ahead to runIamLeaderProcess(...) which leads to 
OverseerElectionContext.runLeaderProcess(...) where it tries to create the 
/overseer_elect/leader node:
{code}
zkClient.makePath(leaderPath, ZkStateReader.toJSON(myProps),
CreateMode.EPHEMERAL, true);
{code}
This is where things go wrong. Because the /overseer_elect/leader node already 
existed, the zkClient.makePath fails and the node decides to give up because it 
think that there is already a leader. It never tries to rejoin election ever. 
Then once the ephemeral /overseer_elect/leader node goes away (after the 
previous overseer leader exits), the cluster is left with no leader.

Shouldn't the node next in line to become a leader try again or rejoin the 
election instead of giving up?

 SolrCloud cluster can end up without an overseer
 

 Key: SOLR-6095
 URL: https://issues.apache.org/jira/browse/SOLR-6095
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.8
Reporter: Shalin Shekhar Mangar
 Fix For: 4.9, 5.0


 We have a large cluster running on ec2 which occasionally ends up without an 
 overseer after a rolling restart. We always restart our overseer nodes at the 
 very last otherwise we end up with a large number of shards that can't 
 recover properly.
 This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
 SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
 (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
 using the overseer roles feature to designate two specified nodes as 
 overseers. However, I think the problem that we're seeing is not specific to 
 the overseer roles feature.
 As soon as the overseer was shutdown, we saw the following on the node which 
 was next in line to become the overseer:
 {code}
 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
 - I am going to be the leader ec2-xx.compute-1.amazonaws.com:8987_solr
 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists for /overseer_elect/leader
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
   at 
 org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
   at 
 org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
   at 
 org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
   at 
 org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
   at 
 org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
   at 
 org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
   at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
   at 
 org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
   at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}
 When the overseer leader node is gracefully shutdown, we get the following in 
 the logs:
 {code}
 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
 Overseer main queue loop
 org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
   at 
 org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
   at 
 org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
   at