[ 
https://issues.apache.org/jira/browse/KAFKA-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-7415:
-----------------------------------
    Fix Version/s: 1.1.2

> OffsetsForLeaderEpoch may incorrectly respond with undefined epoch causing 
> truncation to HW
> -------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7415
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7415
>             Project: Kafka
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 2.0.0
>            Reporter: Anna Povzner
>            Assignee: Jason Gustafson
>            Priority: Major
>             Fix For: 1.1.2, 2.0.1, 2.1.0
>
>
> If the follower's last appended epoch is ahead of the leader's last appended 
> epoch, the OffsetsForLeaderEpoch response will incorrectly send 
> (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET), and the follower will truncate to 
> HW. This may lead to data loss in some rare cases where 2 back-to-back leader 
> elections happen (failure of one leader, followed by quick re-election of the 
> next leader due to preferred leader election, so that all replicas are still 
> in the ISR, and then failure of the 3rd leader).
> The bug is in LeaderEpochFileCache.endOffsetFor(), which returns 
> (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET) if the requested leader epoch is 
> ahead of the last leader epoch in the cache. The method should return (last 
> leader epoch in the cache, LEO) in this scenario.
> We don't create an entry in a leader epoch cache until a message is appended 
> with the new leader epoch. Every append to log calls 
> LeaderEpochFileCache.assign(). However, it would be much cleaner if 
> `makeLeader` created an entry in the cache as soon as replica becomes a 
> leader, which will fix the bug. In case the leader never appends any 
> messages, and the next leader epoch starts with the same offset, we already 
> have clearAndFlushLatest() that clears entries with start offsets greater or 
> equal to the passed offset. LeaderEpochFileCache.assign() could be merged 
> with clearAndFlushLatest(), so that we clear cache entries with offsets equal 
> or greater than the start offset of the new epoch, so that we do not need to 
> call these methods separately. 
>  
> Here is an example of a scenario where the issue leads to the data loss.
> Suppose we have three replicas: r1, r2, and r3. Initially, the ISR consists 
> of (r1, r2, r3) and the leader is r1. The data up to offset 10 has been 
> committed to the ISR. Here is the initial state:
> {code:java}
> Leader: r1
> leader epoch: 0
> ISR(r1, r2, r3)
> r1: [hw=10, leo=10]
> r2: [hw=8, leo=10]
> r3: [hw=5, leo=10]
> {code}
> Replica 1 fails and leaves the ISR, which makes Replica 2 the new leader with 
> leader epoch = 1. The leader appends a batch, but it is not replicated yet to 
> the followers.
> {code:java}
> Leader: r2
> leader epoch: 1
> ISR(r2, r3)
> r1: [hw=10, leo=10]
> r2: [hw=8, leo=11]
> r3: [hw=5, leo=10]
> {code}
> Replica 3 is elected a leader (due to preferred leader election) before it 
> has a chance to truncate, with leader epoch 2. 
> {code:java}
> Leader: r3
> leader epoch: 2
> ISR(r2, r3)
> r1: [hw=10, leo=10]
> r2: [hw=8, leo=11]
> r3: [hw=5, leo=10]
> {code}
> Replica 2 sends OffsetsForLeaderEpoch(leader epoch = 1) to Replica 3. Replica 
> 3 incorrectly replies with UNDEFINED_EPOCH_OFFSET, and Replica 2 truncates to 
> HW. If Replica 3 fails before Replica 2 re-fetches the data, this may lead to 
> data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to