[jira] [Commented] (KAFKA-20673) AdminClient partition-leader APIs hang when a cached leader has left the cluster

Lucas Brutschy (Jira) Mon, 08 Jun 2026 09:18:07 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-20673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087030#comment-18087030
 ]


Lucas Brutschy commented on KAFKA-20673:
----------------------------------------

https://issues.apache.org/jira/browse/KAFKA-20628 seems to be related

> AdminClient partition-leader APIs hang when a cached leader has left the 
> cluster
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-20673
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20673
>             Project: Kafka
>          Issue Type: Task
>          Components: clients
>    Affects Versions: 4.0.0, 4.0.1, 4.1.0, 4.2.0, 4.3.0, 4.0.2, 4.1.1, 4.1.2
>            Reporter: Lucas Brutschy
>            Assignee: Lucas Brutschy
>            Priority: Major
>              Labels: regression
>             Fix For: 4.4.0
>
>
> {{KafkaAdminClient.listOffsets}} — and any other 
> {{{}PartitionLeaderStrategy{}}}-routed API ({{{}deleteRecords{}}}, 
> {{{}describeProducers{}}}, {{{}abortTransaction{}}}) — can block for the full 
> {{default.api.timeout.ms}} and then fail with {{TimeoutException: Timed out 
> waiting for a node assignment}} when the partition-leader cache holds a 
> leader id that is no longer present in the cluster metadata.
> Since the partition-leader cache fast-path was added (KAFKA-17663, 
> [#17367|https://github.com/apache/kafka/pull/17367]), the {{AdminApiDriver}} 
> constructor reads {{future.cachedKeyBrokerIdMapping()}} and, for any cached 
> entry, routes the key straight into the fulfillment stage under a 
> {{FulfillmentScope(brokerId)}} — skipping the lookup stage. The resulting 
> {{Call}} is given a {{{}ConstantNodeIdProvider(brokerId){}}}.
> If that cached {{brokerId}} is no longer in 
> {{{}AdminMetadataManager.cluster().nodes(){}}}, the call gets stuck:
>  * {{ConstantNodeIdProvider.provide()}} does {{nodeById(id)}} → {{{}null{}}}, 
> calls {{{}metadataManager.requestUpdate(){}}}, and returns {{{}null{}}}.
>  * {{maybeDrainPendingCall}} sees {{null}} and leaves the call in 
> {{{}pendingCalls{}}}.
>  * The only ways out of {{fulfillmentMap}} are {{unmap()}} via 
> {{retryLookup}} (driven by an {{{}onResponse{}}}/{{{}onFailure{}}} for a 
> request that was actually {_}sent{_}) or a {{{}DisconnectException{}}}. 
> Neither fires for a call that is never sent.
> So the call just spins broker-info ({{{}topics=[]{}}}) metadata refreshes — 
> which never re-resolve the partition leader — until the request deadline 
> expires.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-20673) AdminClient partition-leader APIs hang when a cached leader has left the cluster

Reply via email to