Stanislav Chizhov created KAFKA-6003:
----------------------------------------
Summary: Replication Fetcher thread for a partition with no data
fails to start
Key: KAFKA-6003
URL: https://issues.apache.org/jira/browse/KAFKA-6003
Project: Kafka
Issue Type: Bug
Components: replication
Affects Versions: 0.11.0.1
Reporter: Stanislav Chizhov
If a partition of a topic with idempotent producer has no data on 1 of the
brokers, but it does exist on others and some of the segments for this
partition have been already deleted replication thread responsible for this
partition on the broker which has no data for it fails to start with out of
order sequence exception:
{code}
[2017-10-02 09:44:23,825] ERROR [ReplicaFetcherThread-2-4]: Error due to
(kafka.server.ReplicaFetcherThread)
kafka.common.KafkaException: error processing data for partition
[stage.data.adevents.v2,20] offset 1660336429
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:203)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:174)
at scala.Option.foreach(Option.scala:257)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:174)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:171)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:171)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171)
at
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:169)
at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
Caused by: org.apache.kafka.common.errors.OutOfOrderSequenceException: Invalid
sequence number for new epoch: 0 (request epoch), 154277489 (seq. number)
{code}
We run kafka 0.11.0.1 and we ran into the situation when 1 of replication
threads was stopped for few days, while everything else on that broker was
functional. This is our staging cluster and retention is less than a day, so at
the moment we have a broker which cannot start replication for few partition. I
was also able to reproduce in my local test environment.
Another possible use case is disk failure or any situation when previously
deleting all the data for the partition on a broker helped - since it would
just fetch all the data from other replicas. Now it does not work for topics
with idempotent producers. It might also affect other not-idempotent topics if
those are unlucky to share same replication fetcher thread.
This seems to be caused by this logic:
https://github.com/apache/kafka/blob/0.11.0.1/core/src/main/scala/kafka/log/ProducerStateManager.scala#L119
and might be fixed in the scope of
https://issues.apache.org/jira/browse/KAFKA-5793.
However any hints on how to get those partition to fully replicated state are
highly appreciated.
Any hints on how to get this broker
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)