kirktrue commented on code in PR #17022: URL: https://github.com/apache/kafka/pull/17022#discussion_r2133073450
########## clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java: ########## @@ -779,14 +779,25 @@ public synchronized void maybeTransitionToErrorState(RuntimeException exception) } synchronized void handleFailedBatch(ProducerBatch batch, RuntimeException exception, boolean adjustSequenceNumbers) { - maybeTransitionToErrorState(exception); + // Compare the batch with the current ProducerIdAndEpoch. If the producer IDs are the *same* but the epochs + // are *different*, consider the batch as stale. + boolean isStaleBatch = batch.producerId() == producerIdAndEpoch.producerId && batch.producerEpoch() != producerIdAndEpoch.epoch; Review Comment: Here are the places I found in which a `ProducerIdAndEpoch` is compared: * [`TransactionManager.setProducerIdAndEpoch()`](https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L604-L615) checks the `producerId`, but it appears to only affect logging, though. * [`TransactionManager.maybeUpdateProducerIdAndEpoch()`](https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L589-L602) calls to `hasStaleProducerIdAndEpoch()` to compare its current `ProducerIdAndEpoch` with the one in its `txnPartitionMap`. In the case we're seeing, the producer ID in the `ProducerBatch` is out of sync. I don't know if the `txnPartitionMap` is also out of sync in that case. * [`ProducerBatch.resetProducerState()`](https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/ProducerBatch.java#L476-L481) seems interesting to consider in that maybe it could be called out of sync with the transaction manager? That method is called by [`TxnPartitionEntry.adjustSequencesDueToFailedBatch()`](https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/TxnPartitionEntry.java#L139-L152), but resets the batch with the same `ProducerIdAndEpoch` from the batch. Should it be consulting the `TransactionManager` for the _current_ value? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org