Steven Bower created SOLR-6159:
----------------------------------

             Summary: cancelElection fails on uninitialized ElectionContext
                 Key: SOLR-6159
                 URL: https://issues.apache.org/jira/browse/SOLR-6159
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.8.1
            Reporter: Steven Bower
            Priority: Critical


I had a solr collection that basically was out of memory (no exception, just 
continuous 80-90 second full GCs). This of course is not a good state, but when 
in this state ever time you come out of the GC your zookeeper session has 
expired causing all kinds of havoc. Anyway I found bug in the condition where 
during LeaderElector.setup() if you get a Zookeeper error the 
LeaderElector.context get set to a context that is not fully initialized (ie 
hasn't called joinElection..

Anyway once this happens the node can no longer attempt to join elections 
because every time the LeaderElector attempts to call cancelElection() on the 
previous ElectionContext..

Some logs below and I've attached a patch that does 2 things:

* Move the setting of LeaderElector.context in the setup call to then of the 
call so it is only set if the setup completes.
* Added a check to see if leaderSeqPath is null in 
ElectionContext.cancelElection
* Made leaderSeqPath volatile as it is being directly accessed by multiple 
threads.
* set LeaderElector.context = null when joinElection fails

There may be other issues.. the patch is focused on breaking the failure loop 
that occurs when initialization of the ElectionContext fails.

{noformat}
2014-06-08 23:14:57.805 INFO  ClientCnxn [main-SendThread(host1:1234)] - 
Opening socket connection to server host1/10.122.142.31:1234. Will not attempt 
to authenticate using SASL (unknown error)
2014-06-08 23:14:57.806 INFO  ClientCnxn [main-SendThread(host1:1234)] - Socket 
connection established to host1/10.122.142.31:1234, initiating session
2014-06-08 23:14:57.810 INFO  ClientCnxn [main-SendThread(host1:1234)] - Unable 
to reconnect to ZooKeeper service, session 0x2467d956c8d0446 has expired, 
closing socket connection
2014-06-08 23:14:57.816 INFO  ConnectionManager [main-EventThread] - Watcher 
org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1 
name:ZooKeeperConnection 
Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX
 got event WatchedEvent state:Expired type:None path:null path:null type:None
2014-06-08 23:14:57.817 INFO  ConnectionManager [main-EventThread] - Our 
previous ZooKeeper session was expired. Attempting to reconnect to recover 
relationship with ZooKeeper...
2014-06-08 23:14:57.817 INFO  DefaultConnectionStrategy [main-EventThread] - 
Connection expired - starting a new one...
2014-06-08 23:14:57.817 INFO  ZooKeeper [main-EventThread] - Initiating client 
connection, 
connectString=host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX
 sessionTimeout=15000 
watcher=org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1
2014-06-08 23:14:57.857 INFO  ConnectionManager [main-EventThread] - Waiting 
for client to connect to ZooKeeper
2014-06-08 23:14:57.859 INFO  ClientCnxn [main-SendThread(host4:1234)] - 
Opening socket connection to server host4/172.17.14.107:1234. Will not attempt 
to authenticate using SASL (unknown error)
2014-06-08 23:14:57.891 INFO  ClientCnxn [main-SendThread(host4:1234)] - Socket 
connection established to host4/172.17.14.107:1234, initiating session
2014-06-08 23:14:57.906 INFO  ClientCnxn [main-SendThread(host4:1234)] - 
Session establishment complete on server host4/172.17.14.107:1234, sessionid = 
0x4467d8d79260486, negotiated timeout = 15000
2014-06-08 23:14:57.907 INFO  ConnectionManager [main-EventThread] - Watcher 
org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1 
name:ZooKeeperConnection 
Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX
 got event WatchedEvent state:SyncConnected type:None path:null path:null 
type:None
2014-06-08 23:14:57.909 INFO  ConnectionManager [main-EventThread] - Client is 
connected to ZooKeeper
2014-06-08 23:14:57.909 INFO  ConnectionManager [main-EventThread] - Connection 
with ZooKeeper reestablished.
2014-06-08 23:14:57.911 ERROR ZkController [Thread-203] - 
:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
= Session expired for /overseer_elect/election
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
  at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
  at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
  at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
  at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94)
  at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324)
  at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235)
  at 
org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
2014-06-08 23:14:57.911 WARN  ConnectionManager [Thread-203] - Exception 
running onReconnect command
org.apache.solr.common.cloud.ZooKeeperException:
  at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267)
  at 
org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /overseer_elect/election
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
  at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
  at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
  at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
  at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94)
  at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324)
  at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235)
  ... 1 more
2014-06-08 23:14:57.917 INFO  DefaultConnectionStrategy [main-EventThread] - 
Reconnected to ZooKeeper
2014-06-08 23:14:57.917 INFO  ConnectionManager [main-EventThread] - 
Connected:true
2014-06-08 23:14:57.917 INFO  ClientCnxn [main-EventThread] - EventThread shut 
down
2014-06-08 23:14:57.917 INFO  ZkController [Thread-205] - publishing 
core=XXXXXX state=down collection=XXXXXX
2014-06-08 23:14:57.920 INFO  ZkController [Thread-205] - numShards not found 
on descriptor - reading it from system property
2014-06-08 23:14:58.033 INFO  ElectionContext [Thread-205] - canceling election 
null
2014-06-08 23:14:58.083 ERROR ZkController [Thread-205] - 
:java.lang.IllegalArgumentException: Path cannot be null
  at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
  at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177)
  at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
  at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174)
  at 
org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72)
  at 
org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447)
  at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232)
  at 
org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
2014-06-08 23:14:58.083 WARN  ConnectionManager [Thread-205] - Exception 
running onReconnect command
org.apache.solr.common.cloud.ZooKeeperException:
  at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267)
  at 
org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
Caused by: java.lang.IllegalArgumentException: Path cannot be null
  at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
  at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177)
  at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174)
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
  at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174)
  at 
org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72)
  at 
org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447)
  at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232)
  ... 1 more
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to