John Guerrero created SOLR-4900:
-----------------------------------
Summary: Leader election deadlock after restarting leader in 4.2.1
Key: SOLR-4900
URL: https://issues.apache.org/jira/browse/SOLR-4900
Project: Solr
Issue Type: Bug
Components: SolrCloud
Affects Versions: 4.2
Environment: Linux 64 bit, Tomcat 6.0.35, Java 6u27 64 bit
Reporter: John Guerrero
Copying post from
http://lucene.472066.n3.nabble.com/Leader-election-deadlock-after-restarting-leader-in-4-2-1-td4067988.html
SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 SMP), java
6u27 64 bit
6 nodes, 2 shards, 3 replicas each. Names changed to r1s2 (replica1 - shard
2), r2s2, and r3s2 for each replica in shard 2.
*What we see*:
* Under production load, we restart a leader (r1s2), and observe in the cloud
admin
that the old leader is in state "Down" and no new leader is ever elected.
* The system will stay like this until we stop the old leader (or cause a ZK
timeout...see below).
*Please note*: the leader is killed, then kill -9'd 5 seconds later, before
restarting. We have since changed this.
*Digging into the logs on the old leader (r1s2 = replica1-shard 2)*:
* The old leader restarted at 5:23:29 PM, but appears to be stuck in
SolrDispatchFilter.init() -- (See recovery at bottom).
* It doesn't want to become leader, possibly due to the unclean shutdown.
May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions
INFO: PeerSync: core=browse url=http://r1s2:8080/solr Our versions are too
old. ourHighThreshold=1436325665147191297 otherLowThreshold=1436325775374548992
* It then tries to recover, but cannot, because there is no leader.
May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log
SEVERE: Error while trying to recover.
core=browse:org.apache.solr.common.SolrException: No registered leader was
found, collection:browse slice:shard2
* Meanwhile, it appears that blocking in init(), prevents the http-8080 handler
from starting (See recovery at bottom).
*Digging into the other replicas (r2s2)*:
* For some reason, the old leader (r1s2) remains in the list of replicas that
r2s2 attempts to sync to.
May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=browse url=http://r2s2:8080/solr START
replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/]
nUpdates=100
* This apparently fails (30 second timeout), possibly due to http-8080 handler
not being started on r1s2.
May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse
WARNING: PeerSync: core=browse url=http://r2s2:8080/solr exception talking to
http://r1s2:8080/solr/browse/, failed
org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting
response from server at: http://r1s2:8080/solr/browse
*At this point, the cluster will remain indefinitely without a leader, if
nothing else changes.*
But in this particular instance, we took some stack and heap dumps from r1s2,
which paused java
long enough to cause a *zookeeper timeout on the old leader (r1s2)*:
May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run
INFO: Client session timed out, have not heard from server in 38226ms for
sessionid 0x23d28e0f584005d, closing socket connection and attempting reconnect
Then, one of the replicas (r3s2) finally stopped trying to sync to r1s2 and
succeeded in becoming leader:
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=browse url=http://r3s2:8080/solr START
replicas=[http://r2s2:8080/solr/browse/] nUpdates=100
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions
INFO: PeerSync: core=browse url=http://r3s2:8080/solr Received 100 versions
from r2s2:8080/solr/browse/
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions
INFO: PeerSync: core=browse url=http://r3s2:8080/solr Our versions are newer.
ourLowThreshold=1436325775374548992 otherHigh=1436325775805513730
May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync
INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync succeeded
Now that we have a leader, r1s2 can succeed in recovery and finish
SolrDispatchFilter.init(),
apparently allowing the http-8080 handler to start (r1s2).
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay
INFO: No replay needed. core=browse
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: Replication Recovery was successful - registering as Active. core=browse
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
INFO: publishing core=browse state=active
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
INFO: numShards not found on descriptor - reading it from system property
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: Finished recovery process. core=browse
May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run
INFO: Starting recovery process. core=browse recoveringAfterStartup=false
May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader
updateClusterState
INFO: Updating cloud state from ZooKeeper...
May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init
INFO: user.dir=/
May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init
*INFO: SolrDispatchFilter.init() done*
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
INFO: publishing core=browse state=recovering
May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish
INFO: numShards not found on descriptor - reading it from system property
May 28, 2013 5:34:49 PM org.apache.solr.client.solrj.impl.HttpClientUtil
createClient
INFO: Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
May 28, 2013 5:34:49 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory ROOT
May 28, 2013 5:34:49 PM org.apache.coyote.http11.Http11AprProtocol start
*INFO: Starting Coyote HTTP/1.1 on http-8080 #<----------------------
http-8080 started!*
May 28, 2013 5:34:49 PM org.apache.coyote.ajp.AjpAprProtocol start
INFO: Starting Coyote AJP/1.3 on ajp-8009
May 28, 2013 5:34:49 PM org.apache.catalina.startup.Catalina start
*INFO: Server startup in 668324 ms #<------------------------------ 668 sec =
11 minutes to start Catalina.*
*Our Workaround*:
* We changed our script to allow 15 seconds before kill -9.
* Also, we no longer do a restart. We just stop the leader and wait for a new
leader. This
still results in a few "No registered leader was found" exceptions, but at
least the duration is short.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]