Alexey Ozeritskiy created KAFKA-2165:
----------------------------------------
Summary: ReplicaFetcherThread: data loss on unknown exception
Key: KAFKA-2165
URL: https://issues.apache.org/jira/browse/KAFKA-2165
Project: Kafka
Issue Type: Bug
Affects Versions: 0.8.2.1
Reporter: Alexey Ozeritskiy
Sometimes in our cluster some replica gets out of the isr. Then broker
redownloads the partition from the beginning. We got the following messages in
logs:
{code}
# The leader:
[2015-03-25 11:11:07,796] ERROR [Replica Manager on Broker 21]: Error when
processing fetch request for partition [topic,11] offset 54369274 from follower
with correlation id 2634499. Possible cause: Request for offset 54369274 but we
only have log segments in the range 49322124 to 54369273.
(kafka.server.ReplicaManager)
{code}
{code}
# The follower:
[2015-03-25 11:11:08,816] WARN [ReplicaFetcherThread-0-21], Replica 31 for
partition [topic,11] reset its fetch offset from 49322124 to current leader
21's start offset 49322124 (kafka.server.ReplicaFetcherThread)
[2015-03-25 11:11:08,816] ERROR [ReplicaFetcherThread-0-21], Current offset
54369274 for partition [topic,11] out of range; reset offset to 49322124
(kafka.server.ReplicaFetcherThread)
{code}
This occures because we update fetchOffset
[here|https://github.com/apache/kafka/blob/0.8.2/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L124]
and then try to process message.
If any exception except OffsetOutOfRangeCode occures we get unsynchronized
fetchOffset and replica.logEndOffset.
On next fetch iteration we can get
fetchOffset>replica.logEndOffset==leaderEndOffset and OffsetOutOfRangeCode.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)