[ 
https://issues.apache.org/jira/browse/KAFKA-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054754#comment-18054754
 ] 

sanghyeok An commented on KAFKA-20090:
--------------------------------------

[~jolshan] 

Thanks for correcting me!
I agree with your framing that these are essentially the same underlying issue: 
we’re treating two different operations as equivalent and, as a result, a 
producer that should be fenced can continue to run. In other words, *the core 
is making sure we don’t misclassify a late client ABORT as a valid retry of a 
fenced producer.*

One observation (please correct me if I’m off): prior to the KAFKA-19367 
change, in some fencing flows we effectively ended up with a +2 pattern in TV2 
( +1 bump for fencing, and another as part of the abort/commit protocol). In 
such a flow, *a late EndTxn ABORT from the old producer might arrive with an 
epoch that is further behind (e.g., currentEpoch - 2), which could make it less 
likely to be considered a valid retry under an epoch-bump-based retry check.* I 
agree that the previous "+2" behavior effectively avoided this ambiguity. 
However, considering that the project moved away from it to mitigate overflow 
risks near the short max boundary, simply reverting to that model might not be 
a free fix.

If we were to re-introduce any approach that relies on “+2-style” separation 
(even partially), I think we would need to carefully consider compatibility and 
rolling upgrade implications. For example, some clusters may already have 
*transactionalIds* with epochs near the boundary (e.g., {*}32766{*}). If we 
tighten or shift retry semantics (or change bump behavior), existing metadata 
could start behaving differently during a rolling upgrade, potentially 
resulting in unexpected fencing or new failure modes unless we have a clear 
migration story.


Echoing what [~chia7712] suggested, I wonder if the cleanest way to avoid these 
ambiguities is to *persist* a small piece of metadata in 
{{__transaction_state}} that captures why the last transition/epoch bump 
happened (e.g., timeout-driven abort vs ownership-change fencing vs 
client-driven EndTxn). Something like a {{lastEpochBumpReason}} (name TBD) 
would let the coordinator distinguish “valid retry” vs “invalid retry” more 
explicitly, rather than inferring intent purely from epoch deltas and state. 
This would also give us a durable signal across restarts/leadership changes and 
could help avoid similar corner cases in the future.

 

What do you both think? 

> TV2 can allow for ongoing transactions with max epoch that never complete
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-20090
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20090
>             Project: Kafka
>          Issue Type: Task
>            Reporter: Justine Olshan
>            Assignee: sanghyeok An
>            Priority: Major
>
> When transaction version 2 was introduced, epoch bumps happen on every 
> transaction. 
> The original EndTransaction logic considers retries and because of epoch 
> bumps we wanted to be careful to not fence ourselves. This means that for 
> EndTransaction retries, we have to check if the epoch has been bumped to 
> consider a retry.
> The original logic returns the current producer ID and epoch in the 
> transaction metadata when a retry has been identified. The normal end 
> transaction case with max epoch - 1 was considered and accounted for – the 
> state there is safe to return to the producer.
> However, we didn't consider that in the case of fencing epoch bumps with max 
> epoch - 1, where we also bump the epoch, but don't create a new producer ID 
> and epoch. In this scenario the producer was expected to be fenced and call 
> init producer ID, so this isn't a problem, but it is a problem if we try to 
> return it to the producer.
> There is a scenario we race a timeout and end transaction abort with max 
> epoch - 1, we can consider the end transaction request a "retry" and return 
> max epoch as the current producer's epoch instead of fencing. 
> 1. The fencing abort on transactional timeout bumps the epoch to max
> 2. The EndTxn request with max epoch - 1 is considered a "retry" and we 
> return max epoch
> 3. The producer can start a transaction since we don't check epochs on 
> starting transactions
> 4. We cannot commit this transaction with TV2 and we cannot timeout the 
> transaction. It is stuck in Ongoing forever. 
> I modified 
> [https://github.com/apache/kafka/blob/aad33e4e41aaa94b06f10a5be0094b717b98900f/core/src/test/scala/unit/kafka/coordinator/transaction/TransactionCoordinatorTest.scala#L1329]
>  to capture this behavior. I added the following code to the end:
> {code:java}
> // Transition to COMPLETE_ABORT since we can't do it via writing markers 
> response callback 
> txnMetadata.completeTransitionTo(new 
> TxnTransitMetadata(txnMetadata.producerId(), txnMetadata.prevProducerId(), 
> txnMetadata.nextProducerId(), Short.MaxValue, Short.MaxValue -1, 
> txnTimeoutMs, txnMetadata.pendingState().get(), new 
> util.HashSet[TopicPartition](), txnMetadata.txnLastUpdateTimestamp(), 
> txnMetadata.txnLastUpdateTimestamp(), TV_2)) 
> coordinator.handleEndTransaction(transactionalId, producerId, 
> epochAtMaxBoundary, TransactionResult.ABORT, TV_2, endTxnCallback) 
> assertEquals(10, newProducerId) assertEquals(Short.MaxValue, newEpoch) 
> assertEquals(Errors.NONE, error){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to