zwangbo created KAFKA-7848:
------------------------------
Summary: Idempotence producer keep retry on
OutOfOrderSequenceException
Key: KAFKA-7848
URL: https://issues.apache.org/jira/browse/KAFKA-7848
Project: Kafka
Issue Type: Bug
Components: clients, core
Affects Versions: 1.1.0
Environment: CentOS Linux release 7.2.1511 (Core)
Reporter: zwangbo
We increase our cluster capacity from 50 brokers to 80 brokers.
We do a broker partition reassign while producers is sending message. After
finished we found a small number of producer in a infinite retry on
OutOfOrderSequenceException. It's recover when we restart problem producer(ask
for a new PID).
We found problem partition error log in broker server.log like:
ERROR [ReplicaManager broker=79] Error processing append operation on partition
xxx1-36 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order
sequence number for producerId 152125: 133262 (incoming seq. number), 133374
(current end sequence number)
ERROR [ReplicaManager broker=79] Error processing append operation on partition
xxx-76 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order
sequence number for producerId 140981: 834530 (incoming seq. number), 834543
(current end sequence number)
Strange things is the incoming seq. number is smaller than borker current end
sequence number. Before this exception problem partition has do a leader
election.
[17:08:20,706] INFO [Partition xxx-76 broker=79] xxx-76 starts at Leader Epoch
2 from offset 217709710. Previous Leader Epoch was: 1 (kafka.cluster.Partition)
[17:08:20,715] INFO [Partition xxx-76 broker=79] xxx-76 starts at Leader Epoch
6 from offset 217709710. Previous Leader Epoch was: 2 (kafka.cluster.Partition)
And in producer side, it has NETWORK_EXCEPTION before into
OutOfOrderSequenceException. So we think maybe some message send success to
broker, but not response to producer. After partition leader change producer
retry those old message always reject by broker because of the
OutOfOrderSequenceException.
Our primary producer config:
enable.idempotence = true
retries = Integer.MAX_VALUE
acks = all
max.in.flight.requests.per.connection = 5
compression.type = lz4
metadata.max.age.ms = 300000
Topic config:
min.insync.replicas = 2
4 replicas each partition
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)