John Guerrero created SOLR-4900:
-----------------------------------

             Summary: Leader election deadlock after restarting leader in 4.2.1
                 Key: SOLR-4900
                 URL: https://issues.apache.org/jira/browse/SOLR-4900
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.2
         Environment: Linux 64 bit, Tomcat 6.0.35, Java 6u27 64 bit
            Reporter: John Guerrero


Copying post from 
http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-td4067988.html

SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 SMP), java 
6u27 64 bit 
6 nodes, 2 shards, 3 replicas each.  Names changed to r1s2 (replica1 - shard 
2), r2s2, and r3s2 for each replica in shard 2. 

*What we see*:
* Under production load, we restart a leader (r1s2), and observe in the cloud 
admin 
that the old leader is in state "Down" and no new leader is ever elected. 
* The system will stay like this until we stop the old leader (or cause a ZK 
timeout...see below). 

*Please note*: the leader is killed, then kill -9'd 5 seconds later, before 
restarting.  We have since changed this. 

*Digging into the logs on the old leader (r1s2 = replica1-shard 2)*:
* The old leader restarted at 5:23:29 PM, but appears to be stuck in 
SolrDispatchFilter.init() -- (See recovery at bottom). 
* It doesn't want to become leader, possibly due to the unclean shutdown. 
May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions 
INFO: PeerSync: core=browse url=http://r1s2:8080/solr  Our versions are too 
old. ourHighThreshold=1436325665147191297 otherLowThreshold=1436325775374548992 
* It then tries to recover, but cannot, because there is no leader. 
May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log 
SEVERE: Error while trying to recover. 
core=browse:org.apache.solr.common.SolrException: No registered leader was 
found, collection:browse slice:shard2 
* Meanwhile, it appears that blocking in init(), prevents the http-8080 handler 
from starting (See recovery at bottom). 

*Digging into the other replicas (r2s2)*:
* For some reason, the old leader (r1s2) remains in the list of replicas that 
r2s2 attempts to sync to. 
May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync 
INFO: PeerSync: core=browse url=http://r2s2:8080/solr START 
replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/] 
nUpdates=100 
* This apparently fails (30 second timeout), possibly due to http-8080 handler 
not being started on r1s2. 
May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse 
WARNING: PeerSync: core=browse url=http://r2s2:8080/solr  exception talking to 
http://r1s2:8080/solr/browse/, failed 
org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting 
response from server at: http://r1s2:8080/solr/browse

*At this point, the cluster will remain indefinitely without a leader, if 
nothing else changes.*

But in this particular instance, we took some stack and heap dumps from r1s2, 
which paused java 
long enough to cause a *zookeeper timeout on the old leader (r1s2)*: 
May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run 
INFO: Client session timed out, have not heard from server in 38226ms for 
sessionid 0x23d28e0f584005d, closing socket connection and attempting reconnect 

Then, one of the replicas (r3s2) finally stopped trying to sync to r1s2 and 
succeeded in becoming leader: 
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync 
INFO: PeerSync: core=browse url=http://r3s2:8080/solr START 
replicas=[http://r2s2:8080/solr/browse/] nUpdates=100 
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions 
INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Received 100 versions 
from r2s2:8080/solr/browse/ 
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions 
INFO: PeerSync: core=browse url=http://r3s2:8080/solr  Our versions are newer. 
ourLowThreshold=1436325775374548992 otherHigh=1436325775805513730 
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync 
INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync succeeded 

Now that we have a leader, r1s2 can succeed in recovery and finish 
SolrDispatchFilter.init(), 
apparently allowing the http-8080 handler to start (r1s2). 
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay 
INFO: No replay needed. core=browse 
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery 
INFO: Replication Recovery was successful - registering as Active. core=browse 
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
INFO: publishing core=browse state=active 
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
INFO: numShards not found on descriptor - reading it from system property 
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery 
INFO: Finished recovery process. core=browse 
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run 
INFO: Starting recovery process.  core=browse recoveringAfterStartup=false 
May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader 
updateClusterState 
INFO: Updating cloud state from ZooKeeper... 
May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init 
INFO: user.dir=/ 
May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init 
*INFO: SolrDispatchFilter.init() done*
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
INFO: publishing core=browse state=recovering 
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish 
INFO: numShards not found on descriptor - reading it from system property 
May 28, 2013 5:34:49 PM org.apache.solr.client.solrj.impl.HttpClientUtil 
createClient 
INFO: Creating new http client, 
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 
May 28, 2013 5:34:49 PM org.apache.catalina.startup.HostConfig deployDirectory 
INFO: Deploying web application directory ROOT 
May 28, 2013 5:34:49 PM org.apache.coyote.http11.Http11AprProtocol start 
*INFO: Starting Coyote HTTP/1.1 on http-8080  #<---------------------- 
http-8080 started!*
May 28, 2013 5:34:49 PM org.apache.coyote.ajp.AjpAprProtocol start 
INFO: Starting Coyote AJP/1.3 on ajp-8009 
May 28, 2013 5:34:49 PM org.apache.catalina.startup.Catalina start 
*INFO: Server startup in 668324 ms #<------------------------------ 668 sec = 
11 minutes to start Catalina.*

*Our Workaround*:
* We changed our script to allow 15 seconds before kill -9. 
* Also, we no longer do a restart.  We just stop the leader and wait for a new 
leader.  This 
still results in a few "No registered leader was found" exceptions, but at 
least the duration is short. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to