[jira] [Comment Edited] (KAFKA-20090) TV2 can allow for ongoing transactions with max epoch that never complete

sanghyeok An (Jira) Mon, 26 Jan 2026 17:44:08 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054490#comment-18054490
 ]


sanghyeok An edited comment on KAFKA-20090 at 1/27/26 1:43 AM:
---------------------------------------------------------------

[~jolshan] [~chia7712] 

It might help to separate this discussion into two related but slightly 
different concerns. 
I may be misunderstanding some details, so please correct me if anything below 
is off. :)

*A. Single-producer, epoch-exhaustion boundary case (this issue describe)*
The reproducer seems to show a race between a timeout-driven abort and a client 
EndTxn ABORT near the max-epoch boundary, where the request can be interpreted 
as a retry and we may end up returning a usable MAX_EPOCH state back to the 
client. With TV2, that appears to open a path to an ONGOING transaction that 
cannot be completed or timed out.
For this issue, one option could be to keep the fix fairly targeted around the 
retry/completion handling, and avoid returning an exhausted epoch to the 
client, potentially steering the flow toward re-init or PID rotation using 
existing mechanisms.

*B. Ownership-change corner case (raised in the last comment of Justine)*
Separately, and independent of max epoch, I think there is a broader question 
when the abort is caused by InitProducerId fencing. If ownership has changed (a 
new producer takes over the transactionalId), then a late ABORT from the old 
producer ideally should not be treated as a valid retry, and we probably want 
to avoid returning the current pid/epoch to the old producer. This makes me 
wonder if the retry classification could incorporate an ownership check in 
addition to just comparing state/result.

As a related thought (more of a conceptual/long-term idea), I wonder if it 
would be cleaner if each InitProducerId call for a given transactionalId always 
resulted in a new producerId, so that producerId alone represents a unique 
producer instance. Today we effectively identify ownership using producerId 
plus epoch, but epoch can change both when a new producer takes over and when 
the coordinator bumps epochs due to other reasons (e.g., timeouts / 
fencing-related transitions). That can make the conceptual meaning of 
producerId plus epoch feel a bit overloaded.

I realize InitProducerId is retryable and always rotating producerId could have 
compatibility/idempotency implications, so this may be better treated as a 
separate discussion rather than part of the minimal fix for this issue.

I’m still learning the TV2/transaction internals, 
so I’d really appreciate any corrections if I got something wrong. :)


was (Author: JIRAUSER303328):
[~jolshan] [~chia7712] 

It might help to separate this discussion into two related but slightly 
different concerns. 
I may be misunderstanding some details, so please correct me if anything below 
is off. :)

*A. Single-producer, epoch-exhaustion boundary case (the reproducer)*
The reproducer seems to show a race between a timeout-driven abort and a client 
EndTxn ABORT near the max-epoch boundary, where the request can be interpreted 
as a retry and we may end up returning a usable MAX_EPOCH state back to the 
client. With TV2, that appears to open a path to an ONGOING transaction that 
cannot be completed or timed out.
For this issue, one option could be to keep the fix fairly targeted around the 
retry/completion handling, and avoid returning an exhausted epoch to the 
client, potentially steering the flow toward re-init or PID rotation using 
existing mechanisms.

*B. Ownership-change corner case (raised in the last comment)*
Separately, and independent of max epoch, I think there is a broader question 
when the abort is caused by InitProducerId fencing. If ownership has changed (a 
new producer takes over the transactionalId), then a late ABORT from the old 
producer ideally should not be treated as a valid retry, and we probably want 
to avoid returning the current pid/epoch to the old producer. This makes me 
wonder if the retry classification could incorporate an ownership check in 
addition to just comparing state/result.

As a related thought (more of a conceptual/long-term idea), I wonder if it 
would be cleaner if each InitProducerId call for a given transactionalId always 
resulted in a new producerId, so that producerId alone represents a unique 
producer instance. Today we effectively identify ownership using producerId 
plus epoch, but epoch can change both when a new producer takes over and when 
the coordinator bumps epochs due to other reasons (e.g., timeouts / 
fencing-related transitions). That can make the conceptual meaning of 
producerId plus epoch feel a bit overloaded.

I realize InitProducerId is retryable and always rotating producerId could have 
compatibility/idempotency implications, so this may be better treated as a 
separate discussion rather than part of the minimal fix for this issue.

I’m still learning the TV2/transaction internals, 
so I’d really appreciate any corrections if I got something wrong. :)

> TV2 can allow for ongoing transactions with max epoch that never complete
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-20090
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20090
>             Project: Kafka
>          Issue Type: Task
>            Reporter: Justine Olshan
>            Assignee: sanghyeok An
>            Priority: Major
>
> When transaction version 2 was introduced, epoch bumps happen on every 
> transaction. 
> The original EndTransaction logic considers retries and because of epoch 
> bumps we wanted to be careful to not fence ourselves. This means that for 
> EndTransaction retries, we have to check if the epoch has been bumped to 
> consider a retry.
> The original logic returns the current producer ID and epoch in the 
> transaction metadata when a retry has been identified. The normal end 
> transaction case with max epoch - 1 was considered and accounted for – the 
> state there is safe to return to the producer.
> However, we didn't consider that in the case of fencing epoch bumps with max 
> epoch - 1, where we also bump the epoch, but don't create a new producer ID 
> and epoch. In this scenario the producer was expected to be fenced and call 
> init producer ID, so this isn't a problem, but it is a problem if we try to 
> return it to the producer.
> There is a scenario we race a timeout and end transaction abort with max 
> epoch - 1, we can consider the end transaction request a "retry" and return 
> max epoch as the current producer's epoch instead of fencing. 
> 1. The fencing abort on transactional timeout bumps the epoch to max
> 2. The EndTxn request with max epoch - 1 is considered a "retry" and we 
> return max epoch
> 3. The producer can start a transaction since we don't check epochs on 
> starting transactions
> 4. We cannot commit this transaction with TV2 and we cannot timeout the 
> transaction. It is stuck in Ongoing forever. 
> I modified 
> [https://github.com/apache/kafka/blob/aad33e4e41aaa94b06f10a5be0094b717b98900f/core/src/test/scala/unit/kafka/coordinator/transaction/TransactionCoordinatorTest.scala#L1329]
>  to capture this behavior. I added the following code to the end:
> {code:java}
> // Transition to COMPLETE_ABORT since we can't do it via writing markers 
> response callback 
> txnMetadata.completeTransitionTo(new 
> TxnTransitMetadata(txnMetadata.producerId(), txnMetadata.prevProducerId(), 
> txnMetadata.nextProducerId(), Short.MaxValue, Short.MaxValue -1, 
> txnTimeoutMs, txnMetadata.pendingState().get(), new 
> util.HashSet[TopicPartition](), txnMetadata.txnLastUpdateTimestamp(), 
> txnMetadata.txnLastUpdateTimestamp(), TV_2)) 
> coordinator.handleEndTransaction(transactionalId, producerId, 
> epochAtMaxBoundary, TransactionResult.ABORT, TV_2, endTxnCallback) 
> assertEquals(10, newProducerId) assertEquals(Short.MaxValue, newEpoch) 
> assertEquals(Errors.NONE, error){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-20090) TV2 can allow for ongoing transactions with max epoch that never complete

Reply via email to