[jira] [Updated] (KAFKA-6003) Replication Fetcher thread for a partition with no data fails to start

Stanislav Chizhov (JIRA) Mon, 02 Oct 2017 12:56:26 -0700

     [ 
https://issues.apache.org/jira/browse/KAFKA-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stanislav Chizhov updated KAFKA-6003:
-------------------------------------
    Description: 
If a partition of a topic with idempotent producer has no data on 1 of the 
brokers, but it does exist on others and some of the segments for this 
partition have been already deleted replication thread responsible for this 
partition on the broker which has no data for it fails to start with out of 
order sequence exception:
{code}
[2017-10-02 09:44:23,825] ERROR [ReplicaFetcherThread-2-4]: Error due to 
(kafka.server.ReplicaFetcherThread)
kafka.common.KafkaException: error processing data for partition 
[stage.data.adevents.v2,20] offset 1660336429
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:203)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:174)
        at scala.Option.foreach(Option.scala:257)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:174)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:171)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:171)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
        at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:169)
        at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
Caused by: org.apache.kafka.common.errors.OutOfOrderSequenceException: Invalid 
sequence number for new epoch: 0 (request epoch), 154277489 (seq. number)
{code}
We run kafka 0.11.0.1 and we ran into the situation when 1 of replication 
threads was stopped for few days, while everything else on that broker was 
functional. This is our staging cluster and retention is less than a day, so 
everything for partitions for which replication thread was down was cleaned up. 
At the moment we have a broker which cannot start replication for few 
partitions. I was also able to reproduce in my local test environment.
Another possible use case when this might cause real pain is disk failure or 
any situation when previously deleting all the data for the partition on a 
broker helped - since it would just fetch all the data from other replicas. Now 
it does not work for topics with idempotent producers. It might also affect 
other not-idempotent topics if those are unlucky to share same replication 
fetcher thread. 

This seems to be caused by this logic: 
https://github.com/apache/kafka/blob/0.11.0.1/core/src/main/scala/kafka/log/ProducerStateManager.scala#L119

and might be fixed in the scope of 
https://issues.apache.org/jira/browse/KAFKA-5793.

However any hints on how to get those partition to fully replicated state are 
highly appreciated.
Any hints on how to get this broker 

  was:
If a partition of a topic with idempotent producer has no data on 1 of the 
brokers, but it does exist on others and some of the segments for this 
partition have been already deleted replication thread responsible for this 
partition on the broker which has no data for it fails to start with out of 
order sequence exception:
{code}
[2017-10-02 09:44:23,825] ERROR [ReplicaFetcherThread-2-4]: Error due to 
(kafka.server.ReplicaFetcherThread)
kafka.common.KafkaException: error processing data for partition 
[stage.data.adevents.v2,20] offset 1660336429
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:203)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:174)
        at scala.Option.foreach(Option.scala:257)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:174)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:171)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:171)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171)
        at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
        at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:169)
        at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
Caused by: org.apache.kafka.common.errors.OutOfOrderSequenceException: Invalid 
sequence number for new epoch: 0 (request epoch), 154277489 (seq. number)
{code}
We run kafka 0.11.0.1 and we ran into the situation when 1 of replication 
threads was stopped for few days, while everything else on that broker was 
functional. This is our staging cluster and retention is less than a day, so 
everything for partitions for which replication thread was down was cleaned up. 
At the moment we have a broker which cannot start replication for few 
partitions. I was also able to reproduce in my local test environment.
Another possible use case is disk failure or any situation when previously 
deleting all the data for the partition on a broker helped - since it would 
just fetch all the data from other replicas. Now it does not work for topics 
with idempotent producers. It might also affect other not-idempotent topics if 
those are unlucky to share same replication fetcher thread. 

This seems to be caused by this logic: 
https://github.com/apache/kafka/blob/0.11.0.1/core/src/main/scala/kafka/log/ProducerStateManager.scala#L119

and might be fixed in the scope of 
https://issues.apache.org/jira/browse/KAFKA-5793.

However any hints on how to get those partition to fully replicated state are 
highly appreciated.
Any hints on how to get this broker 


> Replication Fetcher thread for a partition with no data fails to start
> ----------------------------------------------------------------------
>
>                 Key: KAFKA-6003
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6003
>             Project: Kafka
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 0.11.0.1
>            Reporter: Stanislav Chizhov
>
> If a partition of a topic with idempotent producer has no data on 1 of the 
> brokers, but it does exist on others and some of the segments for this 
> partition have been already deleted replication thread responsible for this 
> partition on the broker which has no data for it fails to start with out of 
> order sequence exception:
> {code}
> [2017-10-02 09:44:23,825] ERROR [ReplicaFetcherThread-2-4]: Error due to 
> (kafka.server.ReplicaFetcherThread)
> kafka.common.KafkaException: error processing data for partition 
> [stage.data.adevents.v2,20] offset 1660336429
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:203)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:174)
>         at scala.Option.foreach(Option.scala:257)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:174)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:171)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:171)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171)
>         at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
>         at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:169)
>         at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
> Caused by: org.apache.kafka.common.errors.OutOfOrderSequenceException: 
> Invalid sequence number for new epoch: 0 (request epoch), 154277489 (seq. 
> number)
> {code}
> We run kafka 0.11.0.1 and we ran into the situation when 1 of replication 
> threads was stopped for few days, while everything else on that broker was 
> functional. This is our staging cluster and retention is less than a day, so 
> everything for partitions for which replication thread was down was cleaned 
> up. At the moment we have a broker which cannot start replication for few 
> partitions. I was also able to reproduce in my local test environment.
> Another possible use case when this might cause real pain is disk failure or 
> any situation when previously deleting all the data for the partition on a 
> broker helped - since it would just fetch all the data from other replicas. 
> Now it does not work for topics with idempotent producers. It might also 
> affect other not-idempotent topics if those are unlucky to share same 
> replication fetcher thread. 
> This seems to be caused by this logic: 
> https://github.com/apache/kafka/blob/0.11.0.1/core/src/main/scala/kafka/log/ProducerStateManager.scala#L119
> and might be fixed in the scope of 
> https://issues.apache.org/jira/browse/KAFKA-5793.
> However any hints on how to get those partition to fully replicated state are 
> highly appreciated.
> Any hints on how to get this broker 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (KAFKA-6003) Replication Fetcher thread for a partition with no data fails to start

Reply via email to