Jason Gustafson created KAFKA-14397:
---------------------------------------
Summary: Idempotent producer may bump epoch and reset sequence
numbers prematurely
Key: KAFKA-14397
URL: https://issues.apache.org/jira/browse/KAFKA-14397
Project: Kafka
Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jason Gustafson
Suppose that idempotence is enabled in the producer and we send the following
single-record batches to a partition leader:
* A: epoch=0, seq=0
* B: epoch=0, seq=1
* C: epoch=0, seq=2
The partition leader receives all 3 of these batches and commits them to the
log. However, the connection is lost before the `Produce` responses are
received by the client. Subsequent retries by the producer all fail to be
delivered.
It is possible in this scenario for the first batch `A` to reach the delivery
timeout before the subsequence batches. This triggers the following check:
[https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L642.]
Depending whether retries are exhausted, we may adjust sequence numbers.
The intuition behind this check is that if retries have not been exhausted,
then we saw a fatal error and the batch could not have been written to the log.
Hence we should bump the epoch and adjust the sequence numbers of the pending
batches since they are presumed to be doomed to failure. So in this case,
batches B and C might get reset with the bumped epoch:
* B: epoch=1, seq=0
* C: epoch=1, seq=1
This can result in duplicate records in the log.
The root of the issue is that this logic does not account for expiration of the
delivery timeout. When the delivery timeout is reached, the number of retries
is still likely much lower than the max allowed number of retries (which is
`Int.MaxValue` by default).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)