urbandan commented on code in PR #12392:
URL: https://github.com/apache/kafka/pull/12392#discussion_r947670046


##########
clients/src/test/java/org/apache/kafka/clients/producer/internals/TransactionManagerTest.java:
##########
@@ -2594,27 +2596,20 @@ public void testDropCommitOnBatchExpiry() throws 
InterruptedException {
         } catch (ExecutionException e) {
             assertTrue(e.getCause() instanceof  TimeoutException);
         }
+
         runUntil(commitResult::isCompleted);  // the commit shouldn't be 
completed without being sent since the produce request failed.
         assertFalse(commitResult.isSuccessful());  // the commit shouldn't 
succeed since the produce request failed.
-        assertThrows(TimeoutException.class, commitResult::await);
+        assertThrows(KafkaException.class, commitResult::await);
 
-        assertTrue(transactionManager.hasAbortableError());
-        assertTrue(transactionManager.hasOngoingTransaction());
+        assertTrue(transactionManager.hasFatalBumpableError());
+        assertFalse(transactionManager.hasOngoingTransaction());
         assertFalse(transactionManager.isCompleting());
-        assertTrue(transactionManager.transactionContainsPartition(tp0));
 
-        TransactionalRequestResult abortResult = 
transactionManager.beginAbort();
-
-        prepareEndTxnResponse(Errors.NONE, TransactionResult.ABORT, 
producerId, epoch);
-        prepareInitPidResponse(Errors.NONE, false, producerId, (short) (epoch 
+ 1));
-        runUntil(abortResult::isCompleted);
-        assertTrue(abortResult.isSuccessful());
-        assertFalse(transactionManager.hasOngoingTransaction());
-        assertFalse(transactionManager.transactionContainsPartition(tp0));
+        assertThrows(KafkaException.class, () -> 
transactionManager.beginAbort());

Review Comment:
   Yes, that is correct. That abort is causing the issue. The producer just 
assumes that the batches failed, but it is possible that they are still 
in-flight. When that happens, the abort marker might get processed earlier than 
the batch. I've seen this in action, and it corrupts the affected partition 
permanently.
   
   If it is better to keep the producer in a usable state, I can give it a 
shot. I had one experiment in which I tried keeping the producer usable by 
increasing the epoch on the client side once. I believe that it is safe to do 
as the fencing bump will increase the epoch, and the coordinator will never 
return that to any clients.
   
   Please let me know what you think @ijuma @artemlivshits @showuon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to