Rafał Harabień created SOLR-17275:
-------------------------------------
Summary: Major performance regression of CloudSolrClient in Solr
9.6.0 when using aliases
Key: SOLR-17275
URL: https://issues.apache.org/jira/browse/SOLR-17275
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: SolrJ
Affects Versions: 9.6.0
Environment: SolrJ 9.6.0, Ubuntu 22.04, Java 17
Reporter: Rafał Harabień
Attachments: image-2024-05-06-17-23-42-236.png
I observe worse performance of CloudSolrClient after upgrading from SolrJ 9.5.0
to 9.6.0, especially on p99.
p99 jumped from ~25 ms to ~400 ms
p90 jumped from ~9.9 ms to ~22 ms
p75 jumped from ~7 ms to ~11 ms
p50 jumped from ~4.5 ms to ~7.5 ms
Screenshot from Grafana (at ~14:30 was deployed the new version):
!image-2024-05-06-17-23-42-236.png!
I've got a thread-dump and I can see many threads waiting in
[ZkStateReader.forceUpdateCollection|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L503]:
{noformat}
Thread info: "suggest-solrThreadPool-thread-52" prio=5 Id=600 BLOCKED on
org.apache.solr.common.cloud.ZkStateReader@62e6bc3d owned by
"suggest-solrThreadPool-thread-34" Id=582
at
app//org.apache.solr.common.cloud.ZkStateReader.forceUpdateCollection(ZkStateReader.java:506)
- blocked on org.apache.solr.common.cloud.ZkStateReader@62e6bc3d
at
app//org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getState(ZkClientClusterStateProvider.java:155)
at
app//org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(CloudSolrClient.java:1207)
at
app//org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1099)
at
app//org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:892)
at
app//org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:820)
at
app//org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:255)
at
app//org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:927)
...
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@1beb7ed3
{noformat}
At the same time qTime from Solr hasn't changed so I'm pretty sure it's a
client regression.
I've tried reproducing it locally and I can see
[forceUpdateCollection|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L503]
function being called for every request in my application. I can see that
[this|https://github.com/apache/solr/commit/8cf552aa3642be473c6a08ce44feceb9cbe396d7]
commit
changed the logic in ZkClientClusterStateProvider.getState so the mentioned
function gets called if clusterState.getCollectionRef [returns
null|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/client/solrj/impl/ZkClientClusterStateProvider.java#L151].
In 9.5.0 it wasn't the case (forceUpdateCollection was not called in this
place). I can see in the debugger that getCollectionRef only supports
collections and not aliases (collectionStates map contains only collections).
In my application all collections are referenced using aliases so I guess
that's why I can see the regression in Solr response time.
I am not familiar with the code enough to prepare a PR but I hope this insight
will be enough to fix this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]