[
https://issues.apache.org/jira/browse/SOLR-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shalin Shekhar Mangar resolved SOLR-6159.
-----------------------------------------
Resolution: Fixed
Fix Version/s: 4.10
5.0
Thanks Steven!
> cancelElection fails on uninitialized ElectionContext
> -----------------------------------------------------
>
> Key: SOLR-6159
> URL: https://issues.apache.org/jira/browse/SOLR-6159
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.8.1
> Reporter: Steven Bower
> Assignee: Shalin Shekhar Mangar
> Priority: Critical
> Fix For: 5.0, 4.10
>
> Attachments: SOLR-6159.patch, SOLR-6159.patch
>
>
> I had a solr collection that basically was out of memory (no exception, just
> continuous 80-90 second full GCs). This of course is not a good state, but
> when in this state ever time you come out of the GC your zookeeper session
> has expired causing all kinds of havoc. Anyway I found bug in the condition
> where during LeaderElector.setup() if you get a Zookeeper error the
> LeaderElector.context get set to a context that is not fully initialized (ie
> hasn't called joinElection..
> Anyway once this happens the node can no longer attempt to join elections
> because every time the LeaderElector attempts to call cancelElection() on the
> previous ElectionContext..
> Some logs below and I've attached a patch that does 2 things:
> * Move the setting of LeaderElector.context in the setup call to then of the
> call so it is only set if the setup completes.
> * Added a check to see if leaderSeqPath is null in
> ElectionContext.cancelElection
> * Made leaderSeqPath volatile as it is being directly accessed by multiple
> threads.
> * set LeaderElector.context = null when joinElection fails
> There may be other issues.. the patch is focused on breaking the failure loop
> that occurs when initialization of the ElectionContext fails.
> {noformat}
> 2014-06-08 23:14:57.805 INFO ClientCnxn [main-SendThread(host1:1234)] -
> Opening socket connection to server host1/10.122.142.31:1234. Will not
> attempt to authenticate using SASL (unknown error)
> 2014-06-08 23:14:57.806 INFO ClientCnxn [main-SendThread(host1:1234)] -
> Socket connection established to host1/10.122.142.31:1234, initiating session
> 2014-06-08 23:14:57.810 INFO ClientCnxn [main-SendThread(host1:1234)] -
> Unable to reconnect to ZooKeeper service, session 0x2467d956c8d0446 has
> expired, closing socket connection
> 2014-06-08 23:14:57.816 INFO ConnectionManager [main-EventThread] - Watcher
> org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1
> name:ZooKeeperConnection
> Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX
> got event WatchedEvent state:Expired type:None path:null path:null type:None
> 2014-06-08 23:14:57.817 INFO ConnectionManager [main-EventThread] - Our
> previous ZooKeeper session was expired. Attempting to reconnect to recover
> relationship with ZooKeeper...
> 2014-06-08 23:14:57.817 INFO DefaultConnectionStrategy [main-EventThread] -
> Connection expired - starting a new one...
> 2014-06-08 23:14:57.817 INFO ZooKeeper [main-EventThread] - Initiating
> client connection,
> connectString=host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX
> sessionTimeout=15000
> watcher=org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1
> 2014-06-08 23:14:57.857 INFO ConnectionManager [main-EventThread] - Waiting
> for client to connect to ZooKeeper
> 2014-06-08 23:14:57.859 INFO ClientCnxn [main-SendThread(host4:1234)] -
> Opening socket connection to server host4/172.17.14.107:1234. Will not
> attempt to authenticate using SASL (unknown error)
> 2014-06-08 23:14:57.891 INFO ClientCnxn [main-SendThread(host4:1234)] -
> Socket connection established to host4/172.17.14.107:1234, initiating session
> 2014-06-08 23:14:57.906 INFO ClientCnxn [main-SendThread(host4:1234)] -
> Session establishment complete on server host4/172.17.14.107:1234, sessionid
> = 0x4467d8d79260486, negotiated timeout = 15000
> 2014-06-08 23:14:57.907 INFO ConnectionManager [main-EventThread] - Watcher
> org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1
> name:ZooKeeperConnection
> Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX
> got event WatchedEvent state:SyncConnected type:None path:null path:null
> type:None
> 2014-06-08 23:14:57.909 INFO ConnectionManager [main-EventThread] - Client
> is connected to ZooKeeper
> 2014-06-08 23:14:57.909 INFO ConnectionManager [main-EventThread] -
> Connection with ZooKeeper reestablished.
> 2014-06-08 23:14:57.911 ERROR ZkController [Thread-203] -
> :org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/election
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> at
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
> at
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
> at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94)
> at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324)
> at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235)
> at
> org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
> 2014-06-08 23:14:57.911 WARN ConnectionManager [Thread-203] - Exception
> running onReconnect command
> org.apache.solr.common.cloud.ZooKeeperException:
> at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267)
> at
> org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/election
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> at
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
> at
> org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
> at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94)
> at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324)
> at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235)
> ... 1 more
> 2014-06-08 23:14:57.917 INFO DefaultConnectionStrategy [main-EventThread] -
> Reconnected to ZooKeeper
> 2014-06-08 23:14:57.917 INFO ConnectionManager [main-EventThread] -
> Connected:true
> 2014-06-08 23:14:57.917 INFO ClientCnxn [main-EventThread] - EventThread
> shut down
> 2014-06-08 23:14:57.917 INFO ZkController [Thread-205] - publishing
> core=XXXXXX state=down collection=XXXXXX
> 2014-06-08 23:14:57.920 INFO ZkController [Thread-205] - numShards not found
> on descriptor - reading it from system property
> 2014-06-08 23:14:58.033 INFO ElectionContext [Thread-205] - canceling
> election null
> 2014-06-08 23:14:58.083 ERROR ZkController [Thread-205] -
> :java.lang.IllegalArgumentException: Path cannot be null
> at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
> at
> org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177)
> at
> org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
> at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174)
> at
> org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72)
> at
> org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447)
> at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232)
> at
> org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
> 2014-06-08 23:14:58.083 WARN ConnectionManager [Thread-205] - Exception
> running onReconnect command
> org.apache.solr.common.cloud.ZooKeeperException:
> at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267)
> at
> org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
> Caused by: java.lang.IllegalArgumentException: Path cannot be null
> at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
> at
> org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177)
> at
> org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
> at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174)
> at
> org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72)
> at
> org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447)
> at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232)
> ... 1 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]