Ramkumar Aiyengar created SOLR-5615:
---------------------------------------
Summary: Deadlock while trying to recover after a ZK session expiry
Key: SOLR-5615
URL: https://issues.apache.org/jira/browse/SOLR-5615
Project: Solr
Issue Type: Bug
Components: SolrCloud
Affects Versions: 4.6, 4.5, 4.4
Reporter: Ramkumar Aiyengar
The sequence of events which might trigger this is as follows:
- Leader of a shard, say OL, has a ZK expiry
- The new leader, NL, starts the election process
- NL, through Overseer, clears the current leader (OL) for the shard from the
cluster state
- OL reconnects to ZK, calls onReconnect from event thread (main-EventThread)
- OL marks itself down
- OL sets up watches for cluster state, and then retrieves it (with no leader
for this shard)
- NL, through Overseer, updates cluster state to mark itself leader for the
shard
- OL tries to register itself as a replica, and waits till the cluster state
is updated
with the new leader from event thread
- ZK sends a watch update to OL, but it is blocked on the event thread waiting
for it.
Oops. This finally breaks out after trying to register itself as replica times
out after 20 mins.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]