[
https://issues.apache.org/jira/browse/KAFKA-20673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087030#comment-18087030
]
Lucas Brutschy commented on KAFKA-20673:
----------------------------------------
https://issues.apache.org/jira/browse/KAFKA-20628 seems to be related
> AdminClient partition-leader APIs hang when a cached leader has left the
> cluster
> --------------------------------------------------------------------------------
>
> Key: KAFKA-20673
> URL: https://issues.apache.org/jira/browse/KAFKA-20673
> Project: Kafka
> Issue Type: Task
> Components: clients
> Affects Versions: 4.0.0, 4.0.1, 4.1.0, 4.2.0, 4.3.0, 4.0.2, 4.1.1, 4.1.2
> Reporter: Lucas Brutschy
> Assignee: Lucas Brutschy
> Priority: Major
> Labels: regression
> Fix For: 4.4.0
>
>
> {{KafkaAdminClient.listOffsets}} — and any other
> {{{}PartitionLeaderStrategy{}}}-routed API ({{{}deleteRecords{}}},
> {{{}describeProducers{}}}, {{{}abortTransaction{}}}) — can block for the full
> {{default.api.timeout.ms}} and then fail with {{TimeoutException: Timed out
> waiting for a node assignment}} when the partition-leader cache holds a
> leader id that is no longer present in the cluster metadata.
> Since the partition-leader cache fast-path was added (KAFKA-17663,
> [#17367|https://github.com/apache/kafka/pull/17367]), the {{AdminApiDriver}}
> constructor reads {{future.cachedKeyBrokerIdMapping()}} and, for any cached
> entry, routes the key straight into the fulfillment stage under a
> {{FulfillmentScope(brokerId)}} — skipping the lookup stage. The resulting
> {{Call}} is given a {{{}ConstantNodeIdProvider(brokerId){}}}.
> If that cached {{brokerId}} is no longer in
> {{{}AdminMetadataManager.cluster().nodes(){}}}, the call gets stuck:
> * {{ConstantNodeIdProvider.provide()}} does {{nodeById(id)}} → {{{}null{}}},
> calls {{{}metadataManager.requestUpdate(){}}}, and returns {{{}null{}}}.
> * {{maybeDrainPendingCall}} sees {{null}} and leaves the call in
> {{{}pendingCalls{}}}.
> * The only ways out of {{fulfillmentMap}} are {{unmap()}} via
> {{retryLookup}} (driven by an {{{}onResponse{}}}/{{{}onFailure{}}} for a
> request that was actually {_}sent{_}) or a {{{}DisconnectException{}}}.
> Neither fires for a call that is never sent.
> So the call just spins broker-info ({{{}topics=[]{}}}) metadata refreshes —
> which never re-resolve the partition leader — until the request deadline
> expires.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)