[ 
https://issues.apache.org/jira/browse/KAFKA-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-10487:
------------------------------------
    Description: 
Consider the following scenario:

Three replicas: A, B, and C. In epoch=1, replica A is the leader and writes up 
to offset 10. The leader then fails with the high watermark at offset 8. 
Replica B had caught up to offset 10 while replica C was at offset 8. Suppose 
that C is elected with epoch=2 and immediately writes records up to offset 10. 
However, it also fails before these records become committed and replica B gets 
elected and writes records
up to offset 12. The epoch cache on each replica will look like the following:

Replica A:
(epoch=1, start_offset=0)

Replica B:
(epoch=1, start_offset=0)
(epoch=3, start_offset=10)

Replica C:
(epoch=1, start_offset=0)
(epoch=2, start_offset=8)

Suppose C comes back online. It will attempt to fetch at offset=10 with 
last_fetched_epoch=3. The leader B will detect log divergence
and will return truncation_offset=10. Replica C will truncate to offset 10 (a 
no-op) and retry the same fetch.

To fix this, I see two options:

Option 1: In the case that the truncation offset equals the fetch offset, we 
can instead return the previous epoch end offset. In this example, we would 
return truncation_offset=0. The downside is that this causes unnecessary 
truncation.

Option 2: Rather than returning only the truncation offset, we can have the 
leader return both the previous "diverging" epoch and its end offset. In this 
example, B would return diverging_epoch=1, end_offset=10. Replica C would then 
know
to truncate to offset 8.

The second option is what was initially specified in the Raft proposal, but we 
changed during the discussion because we were not thinking of this case and we 
thought the response could be simplified. My inclination is to restore the 
originally specified truncation logic.

  was:
Consider the following scenario:

Three replicas: A, B, and C. In epoch=1, replica A is the leader and writes up 
to offset 10. The leader then fails with the high watermark at offset 8. 
Replica B had caught up to offset 10 while replica C was at offset 8. Suppose 
that C is elected with epoch=2 and immediately writes records up to offset 10. 
However, it also fails before these records become committed and replica B gets 
elected and writes records
up to offset 12. The epoch cache on each replica will look like the following:

Replica A:
(epoch=1, start_offset=0)

Replica B:
(epoch=1, start_offset=0)
(epoch=3, start_offset=10)

Replica C:
(epoch=1, start_offset=0)
(epoch=2, start_offset=8)

Suppose C comes back online. It will attempt to fetch at offset=10 with 
last_fetched_epoch=3. The leader B will detect log divergence
and will return truncation_offset=10. Replica C will truncate to offset 10 (a 
no-op) and retry the same fetch.

To fix this, I see two options:

Option 1: In the case that the truncation offset equals the fetch offset, we 
can instead return the previous epoch. In this example, we would return 
truncation_offset=0. The downside is that this causes unnecessary truncation.

Option 2: Rather than returning only the truncation offset, we can have the 
leader return both the previous "diverging" epoch and its end offset. In this 
example, B would return diverging_epoch=1, end_offset=10. Replica C would then 
know
to truncate to offset 8.

The second option is what was initially specified in the Raft proposal, but we 
changed during the discussion because we were not thinking of this case and we 
thought the response could be simplified. My inclination is to restore the 
originally specified truncation logic.


> Fix edge case in Raft truncation protocol
> -----------------------------------------
>
>                 Key: KAFKA-10487
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10487
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>
> Consider the following scenario:
> Three replicas: A, B, and C. In epoch=1, replica A is the leader and writes 
> up to offset 10. The leader then fails with the high watermark at offset 8. 
> Replica B had caught up to offset 10 while replica C was at offset 8. Suppose 
> that C is elected with epoch=2 and immediately writes records up to offset 
> 10. However, it also fails before these records become committed and replica 
> B gets elected and writes records
> up to offset 12. The epoch cache on each replica will look like the following:
> Replica A:
> (epoch=1, start_offset=0)
> Replica B:
> (epoch=1, start_offset=0)
> (epoch=3, start_offset=10)
> Replica C:
> (epoch=1, start_offset=0)
> (epoch=2, start_offset=8)
> Suppose C comes back online. It will attempt to fetch at offset=10 with 
> last_fetched_epoch=3. The leader B will detect log divergence
> and will return truncation_offset=10. Replica C will truncate to offset 10 (a 
> no-op) and retry the same fetch.
> To fix this, I see two options:
> Option 1: In the case that the truncation offset equals the fetch offset, we 
> can instead return the previous epoch end offset. In this example, we would 
> return truncation_offset=0. The downside is that this causes unnecessary 
> truncation.
> Option 2: Rather than returning only the truncation offset, we can have the 
> leader return both the previous "diverging" epoch and its end offset. In this 
> example, B would return diverging_epoch=1, end_offset=10. Replica C would 
> then know
> to truncate to offset 8.
> The second option is what was initially specified in the Raft proposal, but 
> we changed during the discussion because we were not thinking of this case 
> and we thought the response could be simplified. My inclination is to restore 
> the originally specified truncation logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to