[ 
https://issues.apache.org/jira/browse/SOLR-6095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014780#comment-14014780
 ] 

Ramkumar Aiyengar commented on SOLR-6095:
-----------------------------------------

That would explain it, our start script blocks until all cores are active, 
hence we don't have this issue..

> SolrCloud cluster can end up without an overseer
> ------------------------------------------------
>
>                 Key: SOLR-6095
>                 URL: https://issues.apache.org/jira/browse/SOLR-6095
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.8
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Noble Paul
>             Fix For: 4.9, 5.0
>
>
> We have a large cluster running on ec2 which occasionally ends up without an 
> overseer after a rolling restart. We always restart our overseer nodes at the 
> very last otherwise we end up with a large number of shards that can't 
> recover properly.
> This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
> SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
> (120 collections each with approx 5M docs) on 16 Solr nodes. We are also 
> using the overseer roles feature to designate two specified nodes as 
> overseers. However, I think the problem that we're seeing is not specific to 
> the overseer roles feature.
> As soon as the overseer was shutdown, we saw the following on the node which 
> was next in line to become the overseer:
> {code}
> 2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  
> - I am going to be the leader ec2-xxxxxxxxxx.compute-1.amazonaws.com:8987_solr
> 2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /overseer_elect/leader
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
>       at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
>       at 
> org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
>       at 
> org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
>       at 
> org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
>       at 
> org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
>       at 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
>       at 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
>       at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
>       at 
> org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
>       at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
>       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> {code}
> When the overseer leader node is gracefully shutdown, we get the following in 
> the logs:
> {code}
> 2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
> Overseer main queue loop
> org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
>       at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method)
>       at java.lang.Object.wait(Object.java:503)
>       at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
>       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
>       at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
>       at 
> org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
>       ... 4 more
> 2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer 
> Loop exiting : ec2-xxxxxxxxxx.compute-1.amazonaws.com:8986_solr
> 2014-05-20 09:55:39,256 [main-EventThread] WARN  common.cloud.ZkStateReader  
> - ZooKeeper watch triggered, but Solr cannot talk to ZK
> 2014-05-20 09:55:39,259 [ShutdownMonitor] INFO  server.handler.ContextHandler 
>  - stopped 
> o.e.j.w.WebAppContext{/solr,file:/vol0/cloud86/solr-webapp/webapp/},/vol0/cloud86/webapps/solr.war
> {code}
> Notice how the overseer kept on running almost till the last point i.e. until 
> the jetty context stopped. On some runs, we got the following on the overseer 
> leader node on graceful shutdown:
> {code}
> 2014-05-19 21:33:43,657 [Thread-75] ERROR solr.cloud.Overseer  - Exception in 
> Overseer main queue loop
> org.apache.solr.common.SolrException: Could not load collection from ZK:sm71
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
>       at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method)
>       at java.lang.Object.wait(Object.java:503)
>       at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
>       at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1153)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274)
>       at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
>       at 
> org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:769)
>       ... 4 more
> 2014-05-19 21:33:43,662 [main-EventThread] WARN  common.cloud.ZkStateReader  
> - ZooKeeper watch triggered, but Solr cannot talk to ZK
> 2014-05-19 21:33:43,663 [Thread-75] INFO  solr.cloud.Overseer  - Overseer 
> Loop exiting : ec2-xxxxxxxxxxxx.compute-1.amazonaws.com:8987_solr
> 2014-05-19 21:33:43,664 [OverseerExitThread] ERROR solr.cloud.Overseer  - 
> could not read the data
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /overseer_elect/leader
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>       at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274)
>       at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
>       at 
> org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274)
>       at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:329)
>       at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:85)
>       at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater$1.run(Overseer.java:293)
> 2014-05-19 21:33:43,665 [ShutdownMonitor] INFO  server.handler.ContextHandler 
>  - stopped 
> o.e.j.w.WebAppContext{/solr,file:/vol0/cloud87/solr-webapp/webapp/},/vol0/cloud87/webapps/solr.war
> {code}
> Again the overseer was clinging on till the last moment and by the time it 
> exited the ZK session expired and it couldn't delete the 
> /overseer_elect/leader node. The exception on the next-in-line node was the 
> same i.e. NodeExists for /overseer_elect/leader.
> In both cases, we are left with no overseers after restart. I can easily 
> reproduce this problem by just restarting overseer leader nodes repeatedly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to