[
https://issues.apache.org/jira/browse/SOLR-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SOLR-17405:
----------------------------------
Labels: pull-request-available (was: )
> Zookeeper session can be re-established by multiple threads concurrently
> ------------------------------------------------------------------------
>
> Key: SOLR-17405
> URL: https://issues.apache.org/jira/browse/SOLR-17405
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 8.11, 9.6
> Reporter: Pierre Salagnac
> Priority: Major
> Labels: pull-request-available
> Attachments: stack.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Because of a bug in SolrCloud, the Zookeeper session can be re-established by
> multiple threads concurrently when an expiration occurs.
> This portion of the code assumes it is mono-threaded. Because of the bug, the
> last thread re-establishing the session can waif for 30 seconds per core,
> waiting for it to be marked {{DOWN}} while it was previously marked
> {{ACTIVE}} by another thread. With a high number of cores, the Solr server
> can hang for hours before taking traffic again.
> Following exception shows two threads were reestablishing the session
> concurrently. {{ZkController.createEphemeralLiveNode()}} should never be
> invoked twice for the same Zookeeper session.
> {code:java}
> thrown: java. lang.RuntimeException:
> org.apache.solr.common.cloud.ZooKeeperException:
> at
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:178)
> at org.apache.solr. common.cloud.DefaultConnectionStrategy.
> reconnect(DefaultConnectionStrategy.java:57)
> at
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:152)
> at org.apache.
> zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.solr.common.cloud.ZooKeeperException:
> at
> org.apache.solr.cloud.ZkController$1.command(ZkController.java:462)
> at
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:170)
> ... 4 more
> Caused by: org.apache. zookeeper.KeeperException$NodeExistsException.
> KeeperErrorCode = NodeExists
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:
> 126)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1925)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1830)
> at
> org.apache.solr.common.cloud.SolrZkClient.lambda$multi$11(SolrZkClient.java:666)
> at
> org.apache.solr.common.cloud.ZkCmdExecutor.retry0peration(ZkCmdExecutor.java:71)
> at
> org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:666)
> at org.apache.sol.cloud.ZkController
> CreateEphemeralLiveNode(ZkController.java:1086)
> at
> org.apache.solr.cloud.ZkController$1.command(ZkController.java:411)
> ... 5 more {code}
> h2. Root cause
> This bug occurs because several threads can re-establish the session
> concurrently.
> It cannot happen at the first expiration of the session, thanks to a thread
> pool with a single thread to execute the zookeeper Watcher.
> Bellow is a code snippet from class {{SolrZkClient.ProcessWatchWithExecutor}}
> {code:java}
> if (watcher instanceof ConnectionManager) {
> zkConnManagerCallbackExecutor.submit(() -> watcher.process(event));
> } else {
> .......
> }
> {code}
> Using this dedicated thread pool (with a single thread) is supposed to ensure
> we don’t handle watches for connection related events with multiple threads.
> This works well for the first session expiration.
> Now, when we re-establish the session after the first expiration, we don’t
> use this wrapper to register the watch.
> It is done directly in {{ConnectionManager}} without wrapping the ZK watch.
> In the following snippet, _“this”_ is the ZK watcher instance, but it is not
> wrapper to use a {{{}ProcessWatchWithExecutor{}}}. This means the next events
> will directly be handled by any ZK callback thread.
> {code:java}
> connectionStrategy.reconnect(zkServerAddress,client.getZkClientTimeout(),
> this,
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]