Lucas Brutschy created KAFKA-20673:
--------------------------------------

             Summary: AdminClient partition-leader APIs hang when a cached 
leader has left the cluster
                 Key: KAFKA-20673
                 URL: https://issues.apache.org/jira/browse/KAFKA-20673
             Project: Kafka
          Issue Type: Task
    Affects Versions: 4.1.2, 4.1.1, 4.0.2, 4.3.0, 4.2.0, 4.1.0, 4.0.1, 4.0.0
            Reporter: Lucas Brutschy
            Assignee: Lucas Brutschy
             Fix For: 4.4.0


{{KafkaAdminClient.listOffsets}} — and any other 
{{{}PartitionLeaderStrategy{}}}-routed API ({{{}deleteRecords{}}}, 
{{{}describeProducers{}}}, {{{}abortTransaction{}}}) — can block for the full 
{{default.api.timeout.ms}} and then fail with {{TimeoutException: Timed out 
waiting for a node assignment}} when the partition-leader cache holds a leader 
id that is no longer present in the cluster metadata.



Since the partition-leader cache fast-path was added (KAFKA-17663, 
[#17367|https://github.com/apache/kafka/pull/17367]), the {{AdminApiDriver}} 
constructor reads {{future.cachedKeyBrokerIdMapping()}} and, for any cached 
entry, routes the key straight into the fulfillment stage under a 
{{FulfillmentScope(brokerId)}} — skipping the lookup stage. The resulting 
{{Call}} is given a {{{}ConstantNodeIdProvider(brokerId){}}}.

If that cached {{brokerId}} is no longer in 
{{{}AdminMetadataManager.cluster().nodes(){}}}, the call gets stuck:
 * {{ConstantNodeIdProvider.provide()}} does {{nodeById(id)}} → {{{}null{}}}, 
calls {{{}metadataManager.requestUpdate(){}}}, and returns {{{}null{}}}.
 * {{maybeDrainPendingCall}} sees {{null}} and leaves the call in 
{{{}pendingCalls{}}}.
 * The only ways out of {{fulfillmentMap}} are {{unmap()}} via {{retryLookup}} 
(driven by an {{{}onResponse{}}}/{{{}onFailure{}}} for a request that was 
actually {_}sent{_}) or a {{{}DisconnectException{}}}. Neither fires for a call 
that is never sent.

So the call just spins broker-info ({{{}topics=[]{}}}) metadata refreshes — 
which never re-resolve the partition leader — until the request deadline 
expires.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to