[
https://issues.apache.org/jira/browse/KAFKA-16352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824512#comment-17824512
]
Justine Olshan commented on KAFKA-16352:
----------------------------------------
Thanks for filing [~alivshits]. If I understand correctly, the transaction is
completed on the data partition side (in terms of writing the marker, lso, etc)
but the coordinator is not able to complete the final book-keeping so we are
unable to continue using the transactional id.
> Transaction may get get stuck in PrepareCommit or PrepareAbort state
> --------------------------------------------------------------------
>
> Key: KAFKA-16352
> URL: https://issues.apache.org/jira/browse/KAFKA-16352
> Project: Kafka
> Issue Type: Bug
> Components: core
> Reporter: Artem Livshits
> Assignee: Artem Livshits
> Priority: Major
>
> A transaction took a long time to complete, trying to restart a producer
> would lead to CONCURRENT_TRANSACTION errors. Investigation has shown that
> the transaction was stuck in PrepareCommit for a few days:
> (current time when the investigation happened: Feb 27 2024), transaction
> state:
> {{Type |Name |Value}}
> {{-----------------------------------------------------------------------------------------}}
> {{ref |transactionalId |xxx-yyy}}
> {{long |producerId |299364}}
> {{ref |state |kafka.coordinator.transaction.PrepareCommit$
> @ 0x44fe22760}}
> {{long |txnStartTimestamp |1708619624810 Thu Feb 22 2024 16:33:44.810
> GMT+0000}}
> {{long |txnLastUpdateTimestamp|1708619625335 Thu Feb 22 2024 16:33:45.335
> GMT+0000}}
> {{-----------------------------------------------------------------------------------------}}
> The partition list was empty and transactionsWithPendingMarkers didn't
> contain the reference to the transactional state. In the log there were the
> following relevant messages:
> {{22 Feb 2024 @ 16:33:45.623 UTC [Transaction State Manager 1]: Completed
> loading transaction metadata from __transaction_state-3 for coordinator epoch
> 611}}
> (this is the partition that contains the transactional id). After the data
> is loaded, it sends out markers and etc.
> Then there is this message:
> {{22 Feb 2024 @ 16:33:45.696 UTC [Transaction Marker Request Completion
> Handler 4]: Transaction coordinator epoch for xxx-yyy has changed from 610 to
> 611; cancel sending transaction markers TxnMarkerEntry\{producerId=299364,
> producerEpoch=1005, coordinatorEpoch=610, result=COMMIT,
> partitions=[foo-bar]} to the brokers}}
> this message is logged just before the state is removed
> transactionsWithPendingMarkers, but the state apparently contained the entry
> that was created by the load operation. So the sequence of events probably
> looked like the following:
> # partition load completed
> # commit markers were sent for transactional id xxx-yyy; entry in
> transactionsWithPendingMarkers was created
> # zombie reply from the previous epoch completed, removed entry from
> transactionsWithPendingMarkers
> # commit markers properly completed, but couldn't transition to
> CommitComplete state because transactionsWithPendingMarkers didn't have the
> proper entry, so it got stuck there until the broker was restarted
> Looking at the code there are a few cases that could lead to similar race
> conditions. The fix it to keep track of the PendingCompleteTxn value that
> was used when sending the marker, so that we can only remove the state that
> was created when the marker was sent and not accidentally remove the state
> someone else created.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)