Kay Hartmann created KAFKA-17831:
------------------------------------

             Summary: Transaction coordinators returning 
COORDINATOR_LOAD_IN_PROGRESS until leader changes or brokers are restarted 
after network instability
                 Key: KAFKA-17831
                 URL: https://issues.apache.org/jira/browse/KAFKA-17831
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 3.7.1, 3.6.1
            Reporter: Kay Hartmann


After experiencing a (heavy) network outage/instability, our brokers arrived in 
a state where some producers were not able to perform transactions, but the 
brokers continued to respond to those producers with  
`COORDINATOR_LOAD_IN_PROGRESS`. We were able to see corresponding DEBUG logs in 
the brokers:
{code:java}
2024-08-06 15:22:01,178 DEBUG [TransactionCoordinator id=11] Returning 
COORDINATOR_LOAD_IN_PROGRESS error code to client for my-client's AddPartitions 
request (kafka.coordinator.transaction.TransactionCoordinator) 
[data-plane-kafka-request-handler-5] {code}
This did not occur for all transactions, but for a subset of transactional ids 
with the same hash that would go through the same transaction 
coordinator/partition leader for the corresponding `__transaction_state` 
partition. We were able to resolve this the first time by shifting partition 
leaders for the transaction topic around and the second time by simply 
restarting brokers.

 

This lead us to believe that it has to be some kind of dirty in-memory state 
transaction coordinators have for a `__transaction_state` partition. We found 
two cases 
([#1|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L319],
 
[#2|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L376])
 in which the TransactionStateManager returns `COORDINATOR_LOAD_IN_PROGRESS`. 
In both cases `loadingPartitions` has some state that signals that the 
TransactionStateManager is still occupied with initializing transactional data 
for that `__transaction_state` partition.

We believe that the network outage caused partition leaders to be shifted 
around continuously between their replicas and somehow this lead to outdated 
data in `loadingPartitions` that wasn't cleaned up. I had a look at the 
[method|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L518]
 where it is updated and cleaned, but wasn't able to identify a case in which 
there could be a failure to clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to