[
https://issues.apache.org/jira/browse/KAFKA-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on KAFKA-1097 started by Neha Narkhede.
> Race condition while reassigning low throughput partition leads to incorrect
> ISR information in zookeeper
> ----------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-1097
> URL: https://issues.apache.org/jira/browse/KAFKA-1097
> Project: Kafka
> Issue Type: Bug
> Components: controller
> Affects Versions: 0.8
> Reporter: Neha Narkhede
> Assignee: Neha Narkhede
> Priority: Critical
> Fix For: 0.8.1
>
> Attachments: KAFKA-1097.patch, KAFKA-1097_2013-10-29_10:49:45.patch,
> KAFKA-1097_2013-10-30_21:46:00.patch, KAFKA-1097_2013-10-31_10:37:29.patch,
> KAFKA-1097_2013-11-01_09:55:33.patch
>
>
> While moving partitions, the controller moves the old replicas through the
> following state changes -
> ONLINE -> OFFLINE -> NON_EXISTENT
> During the offline state change, the controller removes the old replica and
> writes the updated ISR to zookeeper and notifies the leader. Note that it
> doesn't notify the old replicas to stop fetching from the leader (to be fixed
> in KAFKA-1032). During the non-existent state change, the controller does not
> write the updated ISR or replica list to zookeeper. Right after the
> non-existent state change, the controller writes the new replica list to
> zookeeper, but does not update the ISR. So an old replica can send a fetch
> request after the offline state change, essentially letting the leader add it
> back to the ISR. The problem is that if there is no new data coming in for
> the partition and the old replica is fully caught up, the leader cannot
> remove it from the ISR. That lets a non existent replica live in the ISR at
> least until new data comes in to the partition
--
This message was sent by Atlassian JIRA
(v6.1#6144)