[jira] [Updated] (KAFKA-17507) WriteTxnMarkers API must not return until markers are written and materialized in group coordinator's cache

David Jacot (Jira) Mon, 16 Dec 2024 23:38:11 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Jacot updated KAFKA-17507:
--------------------------------
    Fix Version/s: 3.9.1

> WriteTxnMarkers API must not return until markers are written and 
> materialized in group coordinator's cache
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-17507
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17507
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: David Jacot
>            Assignee: David Jacot
>            Priority: Major
>             Fix For: 4.0.0, 3.9.1
>
>
> We have observed the below errors in some cluster:
> Uncaught exception in scheduled task 'handleTxnCompletion-902667' 
> exception.message:Trying to complete a transactional offset commit for 
> producerId *** and groupId *** even though the offset commit record itself 
> hasn't been appended to the log.
> When a transaction is completed, the transaction coordinator sends a 
> WriteTxnMarkers request to all the partitions involved in the transaction to 
> write the markers to them. When the broker receives it, it writes the markers 
> and if markers are written to the __consumer_offsets partitions, it informs 
> the group coordinator that it can materialize the pending transactional 
> offsets in its main cache. The group coordinator does this asynchronously 
> since Apache Kafka 2.0, see this 
> [patch|https://github.com/apache/kafka/commit/c53e274d3128bc92f0e8b6a79c407cf764f16f7b].
> The above error appends when the asynchronous operation is executed by the 
> scheduler and the operation finds that there are pending transactional 
> offsets that were not written yet. How come?
> There is actually an issue is the steps described above. The group 
> coordinator does not wait until the asynchronous operation completes to 
> return to the api layer. Hence the WriteTxnMarkers response may be send back 
> to the transaction coordinator before the async operation is actually 
> completed. Hence it is possible that the next transactional produce to be 
> started also before the operation is completed too. This could explain why 
> the group coordinator has pending transactional offsets that are not written 
> yet.
> There is a similar issue when the transaction is aborted. However on this 
> path, we don't have any checks to verify whether all the pending 
> transactional offsets have been written or not so we don't see any errors in 
> our logs. Due to the same race condition, it is possible to actually remove 
> the wrong pending transactional offsets.
> PS: The new group coordinator is not impacted by this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-17507) WriteTxnMarkers API must not return until markers are written and materialized in group coordinator's cache

Reply via email to