[ https://issues.apache.org/jira/browse/KAFKA-16352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Artem Livshits resolved KAFKA-16352. ------------------------------------ Fix Version/s: 3.8.0 Reviewer: Justine Olshan Resolution: Fixed > Transaction may get get stuck in PrepareCommit or PrepareAbort state > -------------------------------------------------------------------- > > Key: KAFKA-16352 > URL: https://issues.apache.org/jira/browse/KAFKA-16352 > Project: Kafka > Issue Type: Bug > Components: core > Reporter: Artem Livshits > Assignee: Artem Livshits > Priority: Major > Fix For: 3.8.0 > > > A transaction took a long time to complete, trying to restart a producer > would lead to CONCURRENT_TRANSACTION errors. Investigation has shown that > the transaction was stuck in PrepareCommit for a few days: > (current time when the investigation happened: Feb 27 2024), transaction > state: > {{Type |Name |Value}} > {{-----------------------------------------------------------------------------------------}} > {{ref |transactionalId |xxx-yyy}} > {{long |producerId |299364}} > {{ref |state |kafka.coordinator.transaction.PrepareCommit$ > @ 0x44fe22760}} > {{long |txnStartTimestamp |1708619624810 Thu Feb 22 2024 16:33:44.810 > GMT+0000}} > {{long |txnLastUpdateTimestamp|1708619625335 Thu Feb 22 2024 16:33:45.335 > GMT+0000}} > {{-----------------------------------------------------------------------------------------}} > The partition list was empty and transactionsWithPendingMarkers didn't > contain the reference to the transactional state. In the log there were the > following relevant messages: > {{22 Feb 2024 @ 16:33:45.623 UTC [Transaction State Manager 1]: Completed > loading transaction metadata from __transaction_state-3 for coordinator epoch > 611}} > (this is the partition that contains the transactional id). After the data > is loaded, it sends out markers and etc. > Then there is this message: > {{22 Feb 2024 @ 16:33:45.696 UTC [Transaction Marker Request Completion > Handler 4]: Transaction coordinator epoch for xxx-yyy has changed from 610 to > 611; cancel sending transaction markers TxnMarkerEntry{producerId=299364, > producerEpoch=1005, coordinatorEpoch=610, result=COMMIT, > partitions=[foo-bar]} to the brokers}} > this message is logged just before the state is removed > transactionsWithPendingMarkers, but the state apparently contained the entry > that was created by the load operation. So the sequence of events probably > looked like the following: > # partition load completed > # commit markers were sent for transactional id xxx-yyy; entry in > transactionsWithPendingMarkers was created > # zombie reply from the previous epoch completed, removed entry from > transactionsWithPendingMarkers > # commit markers properly completed, but couldn't transition to > CommitComplete state because transactionsWithPendingMarkers didn't have the > proper entry, so it got stuck there until the broker was restarted > Looking at the code there are a few cases that could lead to similar race > conditions. The fix is to keep track of the PendingCompleteTxn value that > was used when sending the marker, so that we can only remove the state that > was created when the marker was sent and not accidentally remove the state > someone else created. -- This message was sent by Atlassian Jira (v8.20.10#820010)