[
https://issues.apache.org/jira/browse/KAFKA-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337705#comment-17337705
]
Ismael Juma edited comment on KAFKA-7635 at 7/14/21, 1:09 PM:
--------------------------------------------------------------
This bug has been fixed by KIP-461:
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-461+-+Improve+Replica+Fetcher+behavior+at+handling+partition+failure]
github commit:
[https://github.com/confluentinc/ce-kafka/commit/414852c701763b6f8362b44d156753b6c3ef247a#]
Earliest available release:
[https://github.com/confluentinc/kafka/releases/tag/2.3.1|https://github.com/confluentinc/ce-kafka/releases/tag/2.3.1]
[https://github.com/confluentinc/kafka/releases/tag/2.3.1-rc2|https://github.com/confluentinc/ce-kafka/releases/tag/2.3.1-rc2]
was (Author: yding):
This bug has been fixed by KIP-461:
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-461+-+Improve+Replica+Fetcher+behavior+at+handling+partition+failure]
github commit:
[https://github.com/confluentinc/ce-kafka/commit/414852c701763b6f8362b44d156753b6c3ef247a#]
Earliest available release:
[https://github.com/confluentinc/ce-kafka/releases/tag/2.3.1]
[https://github.com/confluentinc/ce-kafka/releases/tag/2.3.1-rc2]
> FetcherThread stops processing after "Error processing data for partition"
> --------------------------------------------------------------------------
>
> Key: KAFKA-7635
> URL: https://issues.apache.org/jira/browse/KAFKA-7635
> Project: Kafka
> Issue Type: Bug
> Components: replication
> Affects Versions: 2.0.0
> Reporter: Steven Aerts
> Priority: Major
> Attachments: stacktraces.txt
>
>
> After disabling unclean leader leader again after recovery of a situation
> where we enabled unclean leader due to a split brain in zookeeper, we saw
> that some of our brokers stopped replicating their partitions.
> Digging into the logs, we saw that the replica thread was stopped because one
> partition had a failure which threw a [{{Error processing data for
> partition}}
> exception|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L207].
> But the broker kept running and serving the partitions from which it was
> leader.
> We saw three different types of exceptions triggering this (example
> stacktraces attached):
> * {{kafka.common.UnexpectedAppendOffsetException}}
> * {{Trying to roll a new log segment for topic partition partition-b-97 with
> start offset 1388 while it already exists.}}
> * {{Kafka scheduler is not running.}}
> We think there are two acceptable ways for the kafka broker to handle this:
> * Mark those partitions as a partition with error and handle them
> accordingly. As is done [when a {{CorruptRecordException}} or
> {{KafkaStorageException}}|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L196]
> is thrown.
> * Exit the broker as is done [when log truncation is not
> allowed|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/ReplicaFetcherThread.scala#L189].
>
> Maybe even a combination of both. Our probably naive idea is that for the
> first two types the first strategy would be the best, but for the last type,
> it is probably better to re-throw a {{FatalExitError}} and exit the broker.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)