patsonluk opened a new pull request, #1460: URL: https://github.com/apache/solr/pull/1460
https://issues.apache.org/jira/browse/SOLR-16701 # Description This fixes a race condition on PRS enabled collection deletion, which triggers the exception: ``` org.apache.solr.common.SolrException: Error fetching per-replica states at __randomizedtesting.SeedInfo.seed([C2BFFBF8FE49C1E1:F1C8D9E308D2745]:0) at app//org.apache.solr.common.cloud.PerReplicaStatesFetcher.fetch(PerReplicaStatesFetcher.java:49) at app//org.apache.solr.common.cloud.PerReplicaStatesFetcher$LazyPrsSupplier.lambda$new$0(PerReplicaStatesFetcher.java:62) at app//org.apache.solr.common.cloud.DocCollection$PrsSupplier.get(DocCollection.java:515) at app//org.apache.solr.common.cloud.Replica.isLeader(Replica.java:314) at app//org.apache.solr.common.cloud.Slice.findLeader(Slice.java:242) at app//org.apache.solr.common.cloud.Slice.setPrsSupplier(Slice.java:56) at app//org.apache.solr.common.cloud.DocCollection.<init>(DocCollection.java:123) at app//org.apache.solr.common.cloud.ClusterState.collectionFromObjects(ClusterState.java:305) at app//org.apache.solr.common.cloud.ClusterState.createFromCollectionMap(ClusterState.java:254) at app//org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.createFromJsonSupportingLegacyConfigName(ZkClientClusterStateProvider.java:117) at app//org.apache.solr.common.cloud.ZkStateReader.fetchCollectionState(ZkStateReader.java:1695) ``` This could be triggered by: 1. `fetchCollectionState` is called, and the state.json is fetched 2. But before the `fetchCollectionState` fetches the PRS entries, the collection state.json/PRS are deleted by someone else 3. `fetchCollectionState` would throw below exception when it reaches the PRS fetching logic as the Zk node state.json is no longer around # Solution Create a specific exception `PrsZkNodeNotFoundException` (that extends `SolrException`) when the PRS entries cannot be fetched. Then in `ZkStateReader#fetchCollectionState`, catch this exception as well (along with the existing `KeeperException.NoNodeException`), and use the same handling to fetch the state again. # Tests Added `ZkStateReaderTest#testDeletePrsCollection` which reproduce such race condition, and verify that: 1. The `ZkStateReader#fetchCollectionState` should not throw exception, instead, it should eventually return `null` which indicates the collection is deleted 2. The `PrsZkNodeNotFoundException` was indeed triggered Please take note that the test case was built on the `Breakpoint` introduced by another PR https://github.com/apache/solr/pull/1457 # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [ ] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. - [x] I have added tests for my changes. - [ ] I have added documentation for the [Reference Guide](https://github.com/apache/solr/tree/main/solr/solr-ref-guide) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
