[ 
https://issues.apache.org/jira/browse/SOLR-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905831#comment-17905831
 ] 

Jan Høydahl commented on SOLR-17405:
------------------------------------

I saw the pr for this issue being auto closed, without any discussion or 
review. Do you need someone to review your pr?

The fix looks solid.
Guess it is hard to write a test for it?

> Zookeeper session can be re-established by multiple threads concurrently
> ------------------------------------------------------------------------
>
>                 Key: SOLR-17405
>                 URL: https://issues.apache.org/jira/browse/SOLR-17405
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 8.11, 9.6
>            Reporter: Pierre Salagnac
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: stack.png
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Because of a bug in SolrCloud, the Zookeeper session can be re-established by 
> multiple threads concurrently when an expiration occurs.
> This portion of the code assumes it is mono-threaded. Because of the bug, the 
> last thread re-establishing the session can waif for 30 seconds per core, 
> waiting for it to be marked {{DOWN}} while it was previously marked 
> {{ACTIVE}} by another thread. With a high number of cores, the Solr server 
> can hang for hours before taking traffic again.
> Following exception shows two threads were reestablishing the session 
> concurrently. {{ZkController.createEphemeralLiveNode()}} should never be 
> invoked twice for the same Zookeeper session.
> {code:java}
> thrown: java. lang.RuntimeException: 
> org.apache.solr.common.cloud.ZooKeeperException:
>          at 
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:178)
>          at org.apache.solr. common.cloud.DefaultConnectionStrategy. 
> reconnect(DefaultConnectionStrategy.java:57)
>          at 
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:152)
>          at org.apache. 
> zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
>          at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.solr.common.cloud.ZooKeeperException:
>          at 
> org.apache.solr.cloud.ZkController$1.command(ZkController.java:462)
>          at 
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:170)
>          ... 4 more
> Caused by: org.apache. zookeeper.KeeperException$NodeExistsException. 
> KeeperErrorCode = NodeExists
>          at org.apache.zookeeper.KeeperException.create(KeeperException.java: 
> 126)
>          at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1925)
>          at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1830)
>          at 
> org.apache.solr.common.cloud.SolrZkClient.lambda$multi$11(SolrZkClient.java:666)
>          at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retry0peration(ZkCmdExecutor.java:71)
>          at 
> org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:666)
>          at org.apache.sol.cloud.ZkController 
> CreateEphemeralLiveNode(ZkController.java:1086)
>          at 
> org.apache.solr.cloud.ZkController$1.command(ZkController.java:411)
> ... 5 more {code}
> h2. Root cause
> This bug occurs because several threads can re-establish the session 
> concurrently.
> It cannot happen at the first expiration of the session, thanks to a thread 
> pool with a single thread to execute the zookeeper Watcher.
> Bellow is a code snippet from class {{SolrZkClient.ProcessWatchWithExecutor}}
> {code:java}
>         if (watcher instanceof ConnectionManager) {
>           zkConnManagerCallbackExecutor.submit(() -> watcher.process(event));
>         } else {
>            .......
>         }
> {code}
> Using this dedicated thread pool (with a single thread) is supposed to ensure 
> we don’t handle watches for connection related events with multiple threads. 
> This works well for the first session expiration.
> Now, when we re-establish the session after the first expiration, we don’t 
> use this wrapper to register the watch.
> It is done directly in {{ConnectionManager}} without wrapping the ZK watch. 
> In the following snippet, _“this”_ is the ZK watcher instance, but it is not 
> wrapper to use a {{{}ProcessWatchWithExecutor{}}}. This means the next events 
> will directly be handled by any ZK callback thread.
> {code:java}
> connectionStrategy.reconnect(zkServerAddress,client.getZkClientTimeout(), 
> this,
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to