[PR] KAFKA-XXXXX: AdminClient partition-leader APIs hang until default.api.timeout.ms when a cached leader has left the cluster [kafka]

via GitHub Fri, 05 Jun 2026 11:54:49 -0700


lucasbru opened a new pull request, #22493:
URL: https://github.com/apache/kafka/pull/22493


   Since the partition-leader cache fast path was introduced, `AdminApiDriver` 
routes a cached leader straight to the fulfillment stage with a 
`ConstantNodeIdProvider` for that broker id, skipping the lookup stage. If the 
cached broker id is no longer present in the admin client's cluster metadata — 
for example because the broker left the cluster and was not replaced under the 
same id — the call can never be assigned a node. 
`ConstantNodeIdProvider.provide()` returns null and the call sits in 
`pendingCalls` issuing only broker-info metadata refreshes, which never 
re-resolve the partition leader, so the request stays stuck until 
`default.api.timeout.ms` expires and then fails with "Timed out waiting for a 
node assignment". This affects all `PartitionLeaderStrategy`-based APIs: 
`listOffsets`, `deleteRecords`, `describeProducers` and `abortTransaction`.
   
   The cache's existing staleness recovery only covers leader changes that 
surface as per-partition errors such as `NOT_LEADER_OR_FOLLOWER`. There is no 
path that handles a cached leader id simply disappearing from the cluster 
without any such error reaching the admin client, so the stale entry is 
believed until the deadline.
   
   This change detects, on the admin client thread where metadata access is 
safe, that a fulfillment call's target broker is absent from ready metadata, 
and sends the affected keys back to the lookup stage so the leader is 
re-resolved with a fresh topic metadata request. The liveness check is 
performed on the admin client thread rather than where the cache is read: the 
cache is consulted in the `AdminApiDriver` constructor on the calling thread, 
where `AdminMetadataManager` (which is confined to the admin client thread) 
cannot be accessed safely.
   
   The behaviour is covered by a driver-level test that verifies a cached 
fulfillment request is sent back to the lookup stage, and by an end-to-end test 
in `KafkaAdminClientTest` that seeds the cache, removes the cached leader from 
the cluster, and asserts that the subsequent `listOffsets` re-resolves the 
leader and completes instead of timing out.
   
   The JIRA ticket is still to be filed; the title will be updated with the 
KAFKA number.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] KAFKA-XXXXX: AdminClient partition-leader APIs hang until default.api.timeout.ms when a cached leader has left the cluster [kafka]

Reply via email to