[ https://issues.apache.org/jira/browse/KAFKA-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson resolved KAFKA-7786. ------------------------------------ Resolution: Fixed Fix Version/s: 2.1.1 2.2.0 > Fast update of leader epoch may stall partition fetching due to > FENCED_LEADER_EPOCH > ----------------------------------------------------------------------------------- > > Key: KAFKA-7786 > URL: https://issues.apache.org/jira/browse/KAFKA-7786 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.1.0 > Reporter: Anna Povzner > Assignee: Anna Povzner > Priority: Critical > Fix For: 2.2.0, 2.1.1 > > > KIP-320/KAFKA-7395 Added FENCED_LEADER_EPOCH error response to a > OffsetsForLeaderEpoch request if the epoch in the request is lower than the > broker's leader epoch. ReplicaFetcherThread builds a OffsetsForLeaderEpoch > request under _partitionMapLock_, sends the request outside the lock, and > then processes the response under _partitionMapLock_. The broker may receive > LeaderAndIsr with the same leader but with the next leader epoch, remove and > add partition to the fetcher thread (with partition state reflecting the > updated leader epoch) – all while the OffsetsForLeaderEpoch request (with the > old leader epoch) is still outstanding/ waiting for the lock to process the > OffsetsForLeaderEpoch response. As a result, partition gets removed from > partitionStates and this broker will not fetch for this partition until the > next LeaderAndIsr which may take a while. We will see log message like this: > [2018-12-23 07:23:04,802] INFO [ReplicaFetcher replicaId=3, leaderId=2, > fetcherId=0] Partition test_topic-17 has an older epoch (7) than the current > leader. Will await the new LeaderAndIsr state before resuming fetching. > (kafka.server.ReplicaFetcherThread) > We saw this happen with > kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True. > This test does partition re-assignment while bouncing 2 out of 4 total > brokers. When the failure happen, each bounced broker was also a controller. > Because of re-assignment, the controller updates leader epoch without > updating the leader on controller change or on broker startup, so we see > several leader epoch changes without the leader change, which increases the > likelihood of the race condition described above. > Here is exact events that happen in this test (around the failure): > We have 4 brokers Brokers 1, 2, 3, 4. Partition re-assignment is started for > test_topic-17 [2, 4, 1] —> [3, 1, 2]. At time t0, leader of test_topic-17 is > broker 2. > # clean shutdown of broker 3, which is also a controller > # broker 4 becomes controller, continues re-assignment and updates leader > epoch for test_topic-17 to 6 (with same leader) > # broker 2 (leader of test_topic-17) receives new leader epoch: > “test_topic-17 starts at Leader Epoch 6 from offset 1388. Previous Leader > Epoch was: 5” > # broker 3 is started again after clean shutdown > # controller sees broker 3 startup, and sends LeaderAndIsr(leader epoch 6) > to broker 3 > # controller updates leader epoch to 7 > # broker 2 (leader of test_topic-17) receives LeaderAndIsr for leader epoch > 7: “test_topic-17 starts at Leader Epoch 7 from offset 1974. Previous Leader > Epoch was: 6” > # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 6 from > controller: “Added fetcher to broker BrokerEndPoint(id=2) for leader epoch 6” > and sends OffsetsForLeaderEpoch request to broker 2 > # broker 3 receives LeaderAndIsr for test_topic-17 and leader epoch 7 from > controller; removes fetcher thread and adds fetcher thread + executes > AbstractFetcherThread.addPartitions() which updates partition state with > leader epoch 7 > # broker 3 receives FENCED_LEADER_EPOCH in response to > OffsetsForLeaderEpoch(leader epoch 6), because the leader received > LeaderAndIsr for leader epoch 7 before it got OffsetsForLeaderEpoch(leader > epoch 6) from broker 3. As a result, it removes partition from > partitionStates and it does not fetch until controller updates leader epoch > and sends LeaderAndIsr for this partition to broker 3. The test fails, > because re-assignment does not finish on time (due to broker 3 not fetching). > > One way to address this is possibly add more state to PartitionFetchState. > However, we may introduce other race condition. A cleaner way, I think, is to > return leader epoch in the OffsetsForLeaderEpoch response with > FENCED_LEADER_EPOCH error, and then ignore the error if partition state > contains a higher leader epoch. The advantage is less state maintenance, but > disadvantage is it requires bumping inter-broker protocol. > h1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)