[
https://issues.apache.org/jira/browse/SOLR-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846729#comment-17846729
]
Aparna Suresh edited comment on SOLR-17275 at 5/15/24 6:53 PM:
---------------------------------------------------------------
I have reverted the change on the client side to call forceUpdateCollection()
every time an alias is resolved. I think the performance impact was in terms of
introducing contention in forceUpdateCollection() (but not invoking ZK
getData() in any way), since it synchronizes on ZkStateReader object, but
should otherwise return pretty quickly, since calls to both
{code:java}
clusterState.getCollectionRef(collection){code}
and
{code:java}
tryLazyCollection.get(){code}
should return null for an alias.
Nevertheless the revert makes sense since the call to
{code:java}
CloudSolrClient.resolveAlias(){code}
should not ultimately call
{code:java}
ZkStateReader.forceUpdateCollection(){code}
since the latter is relevant only for collections.
was (Author: JIRAUSER302780):
I have reverted the change on the client side to call forceUpdateCollection()
every time an alias is resolved. I think the performance impact was in terms of
introducing contention in forceUpdateCollection(), since it synchronizes on
ZkStateReader object, but should otherwise return pretty quickly, since calls
to both
{code:java}
clusterState.getCollectionRef(collection){code}
and
{code:java}
tryLazyCollection.get(){code}
should return null for an alias.
Nevertheless the revert makes sense since the call to
{code:java}
CloudSolrClient.resolveAlias(){code}
should not ultimately call
{code:java}
ZkStateReader.forceUpdateCollection(){code}
since the latter is relevant only for collections.
> Major performance regression of CloudSolrClient in Solr 9.6.0 when using
> aliases
> --------------------------------------------------------------------------------
>
> Key: SOLR-17275
> URL: https://issues.apache.org/jira/browse/SOLR-17275
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrJ
> Affects Versions: 9.6.0
> Environment: SolrJ 9.6.0, Ubuntu 22.04, Java 17
> Reporter: Rafał Harabień
> Priority: Blocker
> Fix For: 9.6.1
>
> Attachments: image-2024-05-06-17-23-42-236.png
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> I observe worse performance of CloudSolrClient after upgrading from SolrJ
> 9.5.0 to 9.6.0, especially on p99.
> p99 jumped from ~25 ms to ~400 ms
> p90 jumped from ~9.9 ms to ~22 ms
> p75 jumped from ~7 ms to ~11 ms
> p50 jumped from ~4.5 ms to ~7.5 ms
> Screenshot from Grafana (at ~14:30 was deployed the new version):
> !image-2024-05-06-17-23-42-236.png!
> I've got a thread-dump and I can see many threads waiting in
> [ZkStateReader.forceUpdateCollection|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L503]:
> {noformat}
> Thread info: "suggest-solrThreadPool-thread-52" prio=5 Id=600 BLOCKED on
> org.apache.solr.common.cloud.ZkStateReader@62e6bc3d owned by
> "suggest-solrThreadPool-thread-34" Id=582
> at
> app//org.apache.solr.common.cloud.ZkStateReader.forceUpdateCollection(ZkStateReader.java:506)
> - blocked on org.apache.solr.common.cloud.ZkStateReader@62e6bc3d
> at
> app//org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getState(ZkClientClusterStateProvider.java:155)
> at
> app//org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(CloudSolrClient.java:1207)
> at
> app//org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1099)
> at
> app//org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:892)
> at
> app//org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:820)
> at
> app//org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:255)
> at
> app//org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:927)
> ...
> Number of locked synchronizers = 1
> - java.util.concurrent.ThreadPoolExecutor$Worker@1beb7ed3
> {noformat}
> At the same time qTime from Solr hasn't changed so I'm pretty sure it's a
> client regression.
> I've tried reproducing it locally and I can see
> [forceUpdateCollection|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L503]
> function being called for every request in my application. I can see that
> [this|https://github.com/apache/solr/commit/8cf552aa3642be473c6a08ce44feceb9cbe396d7]
> commit
> changed the logic in ZkClientClusterStateProvider.getState so the mentioned
> function gets called if clusterState.getCollectionRef [returns
> null|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/client/solrj/impl/ZkClientClusterStateProvider.java#L151].
> In 9.5.0 it wasn't the case (forceUpdateCollection was not called in this
> place). I can see in the debugger that getCollectionRef only supports
> collections and not aliases (collectionStates map contains only collections).
> In my application all collections are referenced using aliases so I guess
> that's why I can see the regression in Solr response time.
> I am not familiar with the code enough to prepare a PR but I hope this
> insight will be enough to fix this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]