Jason Gustafson created KAFKA-14397:
---------------------------------------

             Summary: Idempotent producer may bump epoch and reset sequence 
numbers prematurely
                 Key: KAFKA-14397
                 URL: https://issues.apache.org/jira/browse/KAFKA-14397
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


Suppose that idempotence is enabled in the producer and we send the following 
single-record batches to a partition leader:
 * A: epoch=0, seq=0
 * B: epoch=0, seq=1
 * C: epoch=0, seq=2

The partition leader receives all 3 of these batches and commits them to the 
log. However, the connection is lost before the `Produce` responses are 
received by the client. Subsequent retries by the producer all fail to be 
delivered.

It is possible in this scenario for the first batch `A` to reach the delivery 
timeout before the subsequence batches. This triggers the following check: 
[https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L642.]
 Depending whether retries are exhausted, we may adjust sequence numbers.

The intuition behind this check is that if retries have not been exhausted, 
then we saw a fatal error and the batch could not have been written to the log. 
Hence we should bump the epoch and adjust the sequence numbers of the pending 
batches since they are presumed to be doomed to failure. So in this case, 
batches B and C might get reset with the bumped epoch:
 * B: epoch=1, seq=0
 * C: epoch=1, seq=1

This can result in duplicate records in the log.

The root of the issue is that this logic does not account for expiration of the 
delivery timeout. When the delivery timeout is reached, the number of retries 
is still likely much lower than the max allowed number of retries (which is 
`Int.MaxValue` by default).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to