Steven Bower created SOLR-6159:
----------------------------------
Summary: cancelElection fails on uninitialized ElectionContext
Key: SOLR-6159
URL: https://issues.apache.org/jira/browse/SOLR-6159
Project: Solr
Issue Type: Bug
Components: SolrCloud
Affects Versions: 4.8.1
Reporter: Steven Bower
Priority: Critical
I had a solr collection that basically was out of memory (no exception, just
continuous 80-90 second full GCs). This of course is not a good state, but when
in this state ever time you come out of the GC your zookeeper session has
expired causing all kinds of havoc. Anyway I found bug in the condition where
during LeaderElector.setup() if you get a Zookeeper error the
LeaderElector.context get set to a context that is not fully initialized (ie
hasn't called joinElection..
Anyway once this happens the node can no longer attempt to join elections
because every time the LeaderElector attempts to call cancelElection() on the
previous ElectionContext..
Some logs below and I've attached a patch that does 2 things:
* Move the setting of LeaderElector.context in the setup call to then of the
call so it is only set if the setup completes.
* Added a check to see if leaderSeqPath is null in
ElectionContext.cancelElection
* Made leaderSeqPath volatile as it is being directly accessed by multiple
threads.
* set LeaderElector.context = null when joinElection fails
There may be other issues.. the patch is focused on breaking the failure loop
that occurs when initialization of the ElectionContext fails.
{noformat}
2014-06-08 23:14:57.805 INFO ClientCnxn [main-SendThread(host1:1234)] -
Opening socket connection to server host1/10.122.142.31:1234. Will not attempt
to authenticate using SASL (unknown error)
2014-06-08 23:14:57.806 INFO ClientCnxn [main-SendThread(host1:1234)] - Socket
connection established to host1/10.122.142.31:1234, initiating session
2014-06-08 23:14:57.810 INFO ClientCnxn [main-SendThread(host1:1234)] - Unable
to reconnect to ZooKeeper service, session 0x2467d956c8d0446 has expired,
closing socket connection
2014-06-08 23:14:57.816 INFO ConnectionManager [main-EventThread] - Watcher
org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1
name:ZooKeeperConnection
Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX
got event WatchedEvent state:Expired type:None path:null path:null type:None
2014-06-08 23:14:57.817 INFO ConnectionManager [main-EventThread] - Our
previous ZooKeeper session was expired. Attempting to reconnect to recover
relationship with ZooKeeper...
2014-06-08 23:14:57.817 INFO DefaultConnectionStrategy [main-EventThread] -
Connection expired - starting a new one...
2014-06-08 23:14:57.817 INFO ZooKeeper [main-EventThread] - Initiating client
connection,
connectString=host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX
sessionTimeout=15000
watcher=org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1
2014-06-08 23:14:57.857 INFO ConnectionManager [main-EventThread] - Waiting
for client to connect to ZooKeeper
2014-06-08 23:14:57.859 INFO ClientCnxn [main-SendThread(host4:1234)] -
Opening socket connection to server host4/172.17.14.107:1234. Will not attempt
to authenticate using SASL (unknown error)
2014-06-08 23:14:57.891 INFO ClientCnxn [main-SendThread(host4:1234)] - Socket
connection established to host4/172.17.14.107:1234, initiating session
2014-06-08 23:14:57.906 INFO ClientCnxn [main-SendThread(host4:1234)] -
Session establishment complete on server host4/172.17.14.107:1234, sessionid =
0x4467d8d79260486, negotiated timeout = 15000
2014-06-08 23:14:57.907 INFO ConnectionManager [main-EventThread] - Watcher
org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1
name:ZooKeeperConnection
Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX
got event WatchedEvent state:SyncConnected type:None path:null path:null
type:None
2014-06-08 23:14:57.909 INFO ConnectionManager [main-EventThread] - Client is
connected to ZooKeeper
2014-06-08 23:14:57.909 INFO ConnectionManager [main-EventThread] - Connection
with ZooKeeper reestablished.
2014-06-08 23:14:57.911 ERROR ZkController [Thread-203] -
:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
= Session expired for /overseer_elect/election
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
at
org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100)
at
org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94)
at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324)
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235)
at
org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
2014-06-08 23:14:57.911 WARN ConnectionManager [Thread-203] - Exception
running onReconnect command
org.apache.solr.common.cloud.ZooKeeperException:
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267)
at
org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect/election
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
at
org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100)
at
org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94)
at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324)
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235)
... 1 more
2014-06-08 23:14:57.917 INFO DefaultConnectionStrategy [main-EventThread] -
Reconnected to ZooKeeper
2014-06-08 23:14:57.917 INFO ConnectionManager [main-EventThread] -
Connected:true
2014-06-08 23:14:57.917 INFO ClientCnxn [main-EventThread] - EventThread shut
down
2014-06-08 23:14:57.917 INFO ZkController [Thread-205] - publishing
core=XXXXXX state=down collection=XXXXXX
2014-06-08 23:14:57.920 INFO ZkController [Thread-205] - numShards not found
on descriptor - reading it from system property
2014-06-08 23:14:58.033 INFO ElectionContext [Thread-205] - canceling election
null
2014-06-08 23:14:58.083 ERROR ZkController [Thread-205] -
:java.lang.IllegalArgumentException: Path cannot be null
at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177)
at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174)
at
org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72)
at
org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447)
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232)
at
org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
2014-06-08 23:14:58.083 WARN ConnectionManager [Thread-205] - Exception
running onReconnect command
org.apache.solr.common.cloud.ZooKeeperException:
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267)
at
org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
Caused by: java.lang.IllegalArgumentException: Path cannot be null
at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177)
at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174)
at
org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72)
at
org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447)
at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232)
... 1 more
{noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]