urbandan commented on code in PR #12392:
URL: https://github.com/apache/kafka/pull/12392#discussion_r947670046
##########
clients/src/test/java/org/apache/kafka/clients/producer/internals/TransactionManagerTest.java:
##########
@@ -2594,27 +2596,20 @@ public void testDropCommitOnBatchExpiry() throws
InterruptedException {
} catch (ExecutionException e) {
assertTrue(e.getCause() instanceof TimeoutException);
}
+
runUntil(commitResult::isCompleted); // the commit shouldn't be
completed without being sent since the produce request failed.
assertFalse(commitResult.isSuccessful()); // the commit shouldn't
succeed since the produce request failed.
- assertThrows(TimeoutException.class, commitResult::await);
+ assertThrows(KafkaException.class, commitResult::await);
- assertTrue(transactionManager.hasAbortableError());
- assertTrue(transactionManager.hasOngoingTransaction());
+ assertTrue(transactionManager.hasFatalBumpableError());
+ assertFalse(transactionManager.hasOngoingTransaction());
assertFalse(transactionManager.isCompleting());
- assertTrue(transactionManager.transactionContainsPartition(tp0));
- TransactionalRequestResult abortResult =
transactionManager.beginAbort();
-
- prepareEndTxnResponse(Errors.NONE, TransactionResult.ABORT,
producerId, epoch);
- prepareInitPidResponse(Errors.NONE, false, producerId, (short) (epoch
+ 1));
- runUntil(abortResult::isCompleted);
- assertTrue(abortResult.isSuccessful());
- assertFalse(transactionManager.hasOngoingTransaction());
- assertFalse(transactionManager.transactionContainsPartition(tp0));
+ assertThrows(KafkaException.class, () ->
transactionManager.beginAbort());
Review Comment:
Yes, that is correct. That abort is causing the issue. The producer just
assumes that the batches failed, but it is possible that they are still
in-flight. When that happens, the abort marker might get processed earlier than
the batch. I've seen this in action, and it corrupts the affected partition
permanently.
If it is better to keep the producer in a usable state, I can give it a
shot. I had one experiment in which I tried keeping the producer usable by
increasing the epoch on the client side once. I believe that it is safe to do
as the fencing bump will increase the epoch, and the coordinator will never
return that to any clients.
Please let me know what you think @ijuma @artemlivshits @showuon
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]