chickenchickenlove commented on code in PR #21469:
URL: https://github.com/apache/kafka/pull/21469#discussion_r2868411196


##########
transaction-coordinator/src/main/java/org/apache/kafka/coordinator/transaction/TransactionMetadata.java:
##########
@@ -139,11 +139,12 @@ public TxnTransitMetadata prepareNoTransit() {
 
     public TxnTransitMetadata prepareFenceProducerEpoch() {
         if (producerEpoch == Short.MAX_VALUE)
-            throw new IllegalStateException("Cannot fence producer with epoch 
equal to Short.MaxValue since this would overflow");
+            LOGGER.error("Fencing producer {} {} with epoch equal to 
Short.MaxValue, this must not happen unless there is a bug", transactionalId, 
producerId);
 
         // If we've already failed to fence an epoch (because the write to the 
log failed), we don't increase it again.
         // This is safe because we never return the epoch to client if we fail 
to fence the epoch
-        short bumpedEpoch = hasFailedEpochFence ? producerEpoch : (short) 
(producerEpoch + 1);
+        // Also don't increase if producerEpoch is already at max, to avoid 
overflow.
+        short bumpedEpoch = hasFailedEpochFence || producerEpoch == 
Short.MAX_VALUE ? producerEpoch : (short) (producerEpoch + 1);

Review Comment:
   @artemlivshits 
   Thank you for the detailed explanation and sorry for delayed response. 🙇‍♂️ 
   
   While going through a few scenarios, I noticed a potential concern:
   1. At the moment, the handling seems to be applied only for `PREPARE_ABORT`. 
This makes sense for cases that originate from timeout-based aborts, etc.
   2. However, a client in the `epoch = 32767` state could still send a commit 
request before the transaction times out. In that case, the epoch would become 
negative number. This would throw an `IllegalArgumentException`, and since 
`endTransaction(...)` does not appear to catch it, the client would likely 
receive `UNKNOWN_SERVER_ERROR`. From the client perspective, this may be 
treated as a fatal error and lead to a `KafkaException`.
   
   Based on that, my understanding of the PR’s intent is:
   1. A Commit from a producer with epoch `32767` is treated as `fatal`, and 
that producer will typically become a zombie producer.
   2. If a transaction with epoch `32767` already exists on the transaction 
coordinator, it should be handled only via the `prepareFenceProducerEpoch(...)` 
path. In that case, the transaction can recover by transitioning to `producerId 
+ 1`, `producerEpoch = 0`, and `state = COMPLETE_ABORT`.
   
   Is my understanding correct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to