[ 
https://issues.apache.org/jira/browse/SOLR-5615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864389#comment-13864389
 ] 

Ramkumar Aiyengar commented on SOLR-5615:
-----------------------------------------

Here's some log trace which actually happened, might help understand the 
scenario above..

{code}
2014-01-06 06:22:03,867 INFO [main-EventThread] o.a.s.c.c.ConnectionManager 
[ConnectionManager.java:88] Our previous ZooKeeper session was expired. 
Attempting to reconnect to recover relationship with ZooKeeper...

// ..

2014-01-06 06:22:12,529 INFO [main-EventThread] o.a.s.c.c.ConnectionManager 
[ConnectionManager.java:103] Connection with ZooKeeper reestablished.

// ..

2014-01-06 06:22:36,573 INFO [main-EventThread] o.a.s.c.ZkController 
[ZkController.java:989] publishing core=collection_20131120_shard205_replica2 
state=down

// ..

2014-01-06 06:28:01,479 INFO [main-EventThread] o.a.s.c.c.ZkStateReader 
[ZkStateReader.java:199] Updating cluster state from ZooKeeper... 
2014-01-06 06:28:01,487 INFO [main-EventThread] o.a.s.c.ZkController 
[ZkController.java:651] Register node as live in 
ZooKeeper:/live_nodes/host5:10750_solr

// See trace above, it directly got cluster state from ZK and successfully 
found the leader, so there is actually a leader at this point contrary to what 
it finds below

2014-01-06 06:28:01,567 INFO [main-EventThread] o.a.s.c.c.SolrZkClient 
[SolrZkClient.java:378] makePath: /live_nodes/host5:10750_solr
2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.ZkController 
[ZkController.java:757] Register replica - 
core:collection_20131120_shard241_replica2 address:http://host5:10750/solr 
collection:collection_20131120 shard:shard241
2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil 
[HttpClientUtil.java:103] Creating new http client, 
config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false

// nothing much after this on main-EventThread for 20 mins..

2014-01-06 06:54:01,786 ERROR [main-EventThread] o.a.s.c.ZkController 
[ZkController.java:869] Error getting leader from zk
org.apache.solr.common.SolrException: No registered leader was found, 
collection:collection_20131120 slice:shard241

// Then goes on to the next replica ..

2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.ZkController 
[ZkController.java:757] Register replica - 
core:collection_20131120_shard209_replica2 address:http://host5:10750/solr 
collection:collection_20131120 shard:shard209
2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil 
[HttpClientUtil.java:103] Creating new http client, 
config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false

// waits another twenty mins (by which time I ordered a shutdown, so things 
started erroring out sooner after that)

2014-01-06 07:19:21,656 ERROR [main-EventThread] o.a.s.c.ZkController 
[ZkController.java:869] Error getting leader from zk
org.apache.solr.common.SolrException: No registered leader was found, 
collection:collection_20131120 slice:shard209

// After trying to register all other replicas, these failed fast because we 
had ordered a shutdown already..

2014-01-06 07:19:21,693 INFO [main-EventThread] 
o.a.s.c.c.DefaultConnectionStrategy [DefaultConnectionStrategy.java:48] 
Reconnected to ZooKeeper
2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager 
[ConnectionManager.java:130] Connected:true

// And immediately, *now* it fires all the events it was waiting for!

2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager 
[ConnectionManager.java:72] Watcher 
org.apache.solr.common.cloud.ConnectionManager@2467da0a 
name:ZooKeeperConnection Watcher:host1:11600,host2:11600,host3:11600 got event 
WatchedEvent state:Disconnected type:None path:null path:null type:None
2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.z.ClientCnxn 
[ClientCnxn.java:509] EventThread shut down
{code}


> Deadlock while trying to recover after a ZK session expiry
> ----------------------------------------------------------
>
>                 Key: SOLR-5615
>                 URL: https://issues.apache.org/jira/browse/SOLR-5615
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.4, 4.5, 4.6
>            Reporter: Ramkumar Aiyengar
>         Attachments: SOLR-5615.patch
>
>
> The sequence of events which might trigger this is as follows:
>  - Leader of a shard, say OL, has a ZK expiry
>  - The new leader, NL, starts the election process
>  - NL, through Overseer, clears the current leader (OL) for the shard from 
> the cluster state
>  - OL reconnects to ZK, calls onReconnect from event thread (main-EventThread)
>  - OL marks itself down
>  - OL sets up watches for cluster state, and then retrieves it (with no 
> leader for this shard)
>  - NL, through Overseer, updates cluster state to mark itself leader for the 
> shard
>  - OL tries to register itself as a replica, and waits till the cluster state 
> is updated
>    with the new leader from event thread
>  - ZK sends a watch update to OL, but it is blocked on the event thread 
> waiting for it.
> Oops. This finally breaks out after trying to register itself as replica 
> times out after 20 mins.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to