[ 
https://issues.apache.org/jira/browse/KAFKA-17831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Hartmann updated KAFKA-17831:
---------------------------------
    Description: 
After experiencing a (heavy) network outage/instability, our brokers arrived in 
a state where some producers were not able to perform transactions, but the 
brokers continued to respond to those producers with  
`COORDINATOR_LOAD_IN_PROGRESS`. We were able to see corresponding DEBUG logs in 
the brokers:
{code:java}
2024-08-06 15:22:01,178 DEBUG [TransactionCoordinator id=11] Returning 
COORDINATOR_LOAD_IN_PROGRESS error code to client for my-client's AddPartitions 
request (kafka.coordinator.transaction.TransactionCoordinator) 
[data-plane-kafka-request-handler-5] {code}
This did not occur for all transactions, but for a subset of transactional ids 
with the same hash that would go through the same transaction 
coordinator/partition leader for the corresponding `__transaction_state` 
partition. We were able to resolve this the first time by shifting partition 
leaders for the transaction topic around and the second time by simply 
restarting brokers.

 

This lead us to believe that it has to be some kind of dirty in-memory state 
transaction coordinators have for a `_transaction_state{_}`{_} partition. We 
found two cases 
([#1|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L319],
 
[#2|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L376])
 in which the TransactionStateManager returns 
\{_}`{_}COORDINATOR_LOAD_IN_PROGRESS{_}`.{_} In both cases 
\{_}`{_}loadingPartitions{_}`{_} has some state that signals that the 
TransactionStateManager is still occupied with initializing transactional data 
for that \{_}`{_}_transaction_state` partition.

We believe that the network outage caused partition leaders to be shifted 
around continuously between their replicas and somehow this lead to outdated 
data in `loadingPartitions` that wasn't cleaned up. I had a look at the 
[method|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L518]
 where it is updated and cleaned, but wasn't able to identify a case in which 
there could be a failure to clean.

  was:
After experiencing a (heavy) network outage/instability, our brokers arrived in 
a state where some producers were not able to perform transactions, but the 
brokers continued to respond to those producers with  
`COORDINATOR_LOAD_IN_PROGRESS`. We were able to see corresponding DEBUG logs in 
the brokers:
{code:java}
2024-08-06 15:22:01,178 DEBUG [TransactionCoordinator id=11] Returning 
COORDINATOR_LOAD_IN_PROGRESS error code to client for my-client's AddPartitions 
request (kafka.coordinator.transaction.TransactionCoordinator) 
[data-plane-kafka-request-handler-5] {code}
This did not occur for all transactions, but for a subset of transactional ids 
with the same hash that would go through the same transaction 
coordinator/partition leader for the corresponding `__transaction_state` 
partition. We were able to resolve this the first time by shifting partition 
leaders for the transaction topic around and the second time by simply 
restarting brokers.

 

This lead us to believe that it has to be some kind of dirty in-memory state 
transaction coordinators have for a `__transaction_state` partition. We found 
two cases 
([#1|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L319],
 
[#2|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L376])
 in which the TransactionStateManager returns `COORDINATOR_LOAD_IN_PROGRESS`. 
In both cases `loadingPartitions` has some state that signals that the 
TransactionStateManager is still occupied with initializing transactional data 
for that `__transaction_state` partition.

We believe that the network outage caused partition leaders to be shifted 
around continuously between their replicas and somehow this lead to outdated 
data in `loadingPartitions` that wasn't cleaned up. I had a look at the 
[method|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L518]
 where it is updated and cleaned, but wasn't able to identify a case in which 
there could be a failure to clean.


> Transaction coordinators returning COORDINATOR_LOAD_IN_PROGRESS until leader 
> changes or brokers are restarted after network instability
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-17831
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17831
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.6.1, 3.7.1
>            Reporter: Kay Hartmann
>            Priority: Major
>
> After experiencing a (heavy) network outage/instability, our brokers arrived 
> in a state where some producers were not able to perform transactions, but 
> the brokers continued to respond to those producers with  
> `COORDINATOR_LOAD_IN_PROGRESS`. We were able to see corresponding DEBUG logs 
> in the brokers:
> {code:java}
> 2024-08-06 15:22:01,178 DEBUG [TransactionCoordinator id=11] Returning 
> COORDINATOR_LOAD_IN_PROGRESS error code to client for my-client's 
> AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) 
> [data-plane-kafka-request-handler-5] {code}
> This did not occur for all transactions, but for a subset of transactional 
> ids with the same hash that would go through the same transaction 
> coordinator/partition leader for the corresponding `__transaction_state` 
> partition. We were able to resolve this the first time by shifting partition 
> leaders for the transaction topic around and the second time by simply 
> restarting brokers.
>  
> This lead us to believe that it has to be some kind of dirty in-memory state 
> transaction coordinators have for a `_transaction_state{_}`{_} partition. We 
> found two cases 
> ([#1|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L319],
>  
> [#2|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L376])
>  in which the TransactionStateManager returns 
> \{_}`{_}COORDINATOR_LOAD_IN_PROGRESS{_}`.{_} In both cases 
> \{_}`{_}loadingPartitions{_}`{_} has some state that signals that the 
> TransactionStateManager is still occupied with initializing transactional 
> data for that \{_}`{_}_transaction_state` partition.
> We believe that the network outage caused partition leaders to be shifted 
> around continuously between their replicas and somehow this lead to outdated 
> data in `loadingPartitions` that wasn't cleaned up. I had a look at the 
> [method|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L518]
>  where it is updated and cleaned, but wasn't able to identify a case in which 
> there could be a failure to clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to