Jiangjie Qin created KAFKA-5453: ----------------------------------- Summary: Controller may miss requests sent to the broker when zk session timeout happens. Key: KAFKA-5453 URL: https://issues.apache.org/jira/browse/KAFKA-5453 Project: Kafka Issue Type: Bug Reporter: Jiangjie Qin
The issue I encountered was the following: 1. Partition reassignment was in progress, one replica of a partition is being reassigned from broker 1 to broker 2. 2. Controller received an ISR change notification which indicates broker 2 has caught up. 3. Controller was sending StopReplicaRequest to broker 1. 4. Broker 1 zk session timeout occurs. Controller removed broker 1 from the cluster and cleaned up the queue. i.e. the StopReplicaRequest was removed from the ControllerChannelManager. 5. Broker 1 reconnected to zk and act as if it is still a follower replica of the partition. 6. Broker 1 will always receive exception from the leader because it is not in the replica list. Not sure what is the correct fix here. It seems that broke 1 in this case should ask the controller for the latest replica assignment. There are two related bugs: 1. when a {{NotAssignedReplicaException}} is thrown from {{Partition.updateReplicaLogReadResult()}}, the other partitions in the same request will failed to update the fetch timestamp and offset and thus also drop out of the ISR. 2. The {{NotAssignedReplicaException}} was not properly returned to the replicas, instead, a UnknownServerException is returned. -- This message was sent by Atlassian JIRA (v6.4.14#64029)