Kay Hartmann created KAFKA-17831: ------------------------------------ Summary: Transaction coordinators returning COORDINATOR_LOAD_IN_PROGRESS until leader changes or brokers are restarted after network instability Key: KAFKA-17831 URL: https://issues.apache.org/jira/browse/KAFKA-17831 Project: Kafka Issue Type: Bug Components: core Affects Versions: 3.7.1, 3.6.1 Reporter: Kay Hartmann
After experiencing a (heavy) network outage/instability, our brokers arrived in a state where some producers were not able to perform transactions, but the brokers continued to respond to those producers with `COORDINATOR_LOAD_IN_PROGRESS`. We were able to see corresponding DEBUG logs in the brokers: {code:java} 2024-08-06 15:22:01,178 DEBUG [TransactionCoordinator id=11] Returning COORDINATOR_LOAD_IN_PROGRESS error code to client for my-client's AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) [data-plane-kafka-request-handler-5] {code} This did not occur for all transactions, but for a subset of transactional ids with the same hash that would go through the same transaction coordinator/partition leader for the corresponding `__transaction_state` partition. We were able to resolve this the first time by shifting partition leaders for the transaction topic around and the second time by simply restarting brokers. This lead us to believe that it has to be some kind of dirty in-memory state transaction coordinators have for a `__transaction_state` partition. We found two cases ([#1|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L319], [#2|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L376]) in which the TransactionStateManager returns `COORDINATOR_LOAD_IN_PROGRESS`. In both cases `loadingPartitions` has some state that signals that the TransactionStateManager is still occupied with initializing transactional data for that `__transaction_state` partition. We believe that the network outage caused partition leaders to be shifted around continuously between their replicas and somehow this lead to outdated data in `loadingPartitions` that wasn't cleaned up. I had a look at the [method|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L518] where it is updated and cleaned, but wasn't able to identify a case in which there could be a failure to clean. -- This message was sent by Atlassian Jira (v8.20.10#820010)