lucasbru opened a new pull request, #22493: URL: https://github.com/apache/kafka/pull/22493
Since the partition-leader cache fast path was introduced, `AdminApiDriver` routes a cached leader straight to the fulfillment stage with a `ConstantNodeIdProvider` for that broker id, skipping the lookup stage. If the cached broker id is no longer present in the admin client's cluster metadata — for example because the broker left the cluster and was not replaced under the same id — the call can never be assigned a node. `ConstantNodeIdProvider.provide()` returns null and the call sits in `pendingCalls` issuing only broker-info metadata refreshes, which never re-resolve the partition leader, so the request stays stuck until `default.api.timeout.ms` expires and then fails with "Timed out waiting for a node assignment". This affects all `PartitionLeaderStrategy`-based APIs: `listOffsets`, `deleteRecords`, `describeProducers` and `abortTransaction`. The cache's existing staleness recovery only covers leader changes that surface as per-partition errors such as `NOT_LEADER_OR_FOLLOWER`. There is no path that handles a cached leader id simply disappearing from the cluster without any such error reaching the admin client, so the stale entry is believed until the deadline. This change detects, on the admin client thread where metadata access is safe, that a fulfillment call's target broker is absent from ready metadata, and sends the affected keys back to the lookup stage so the leader is re-resolved with a fresh topic metadata request. The liveness check is performed on the admin client thread rather than where the cache is read: the cache is consulted in the `AdminApiDriver` constructor on the calling thread, where `AdminMetadataManager` (which is confined to the admin client thread) cannot be accessed safely. The behaviour is covered by a driver-level test that verifies a cached fulfillment request is sent back to the lookup stage, and by an end-to-end test in `KafkaAdminClientTest` that seeds the cache, removes the cached leader from the cluster, and asserts that the subsequent `listOffsets` re-resolves the leader and completes instead of timing out. The JIRA ticket is still to be filed; the title will be updated with the KAFKA number. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
