[ 
https://issues.apache.org/jira/browse/KAFKA-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hzh0425 updated KAFKA-16073:
----------------------------
    Description: 
The identified bug in Apache Kafka's tiered storage feature involves a delayed 
update of {{localLogStartOffset}} in the {{UnifiedLog.deleteSegments}} method, 
impacting consumer fetch operations. When segments are deleted from the log's 
memory state, the {{localLogStartOffset}} isn't promptly updated. Concurrently, 
{{ReplicaManager.handleOffsetOutOfRangeError}} checks if a consumer's fetch 
offset is less than the {{{}localLogStartOffset{}}}. If it's greater, Kafka 
erroneously sends an {{OffsetOutOfRangeException}} to the consumer.

In a specific concurrent scenario, imagine sequential offsets: {{{}offset1 < 
offset2 < offset3{}}}. A client requests data at {{{}offset2{}}}. While a 
background deletion process removes segments from memory, it hasn't yet updated 
the {{LocalLogStartOffset}} from {{offset1}} to {{{}offset3{}}}. Consequently, 
when the fetch offset ({{{}offset2{}}}) is evaluated against the stale 
{{offset1}} in {{{}ReplicaManager.handleOffsetOutOfRangeError{}}}, it 
incorrectly triggers an {{{}OffsetOutOfRangeException{}}}. This issue arises 
from the out-of-sync update of {{{}localLogStartOffset{}}}, leading to 
incorrect handling of consumer fetch requests and potential data access errors.

  was:
This bug pertains to Apache Kafka's tiered storage functionality. Specifically, 
it involves a timing issue in the {{UnifiedLog.deleteSegments}} method. The 
method first deletes segments from memory but delays updating the 
{{{}localLogStartOffset{}}}. Meanwhile, in 
{{{}ReplicaManager.handleOffsetOutOfRangeError{}}}, if the fetch offset is less 
than {{{}localLogStartOffset{}}}, it triggers the read remote process. However, 
if it's greater, an {{OffsetOutOfRangeException}} is sent to the client.

Consider a scenario with concurrent operations, where {{{}offset1 < offset2 < 
offset3{}}}. A client requests {{offset2}} while a background thread is 
deleting segments. The segments are deleted in memory, but 
{{LocalLogStartOffset}} is still at {{offset1}} and not yet updated to 
{{{}offset3{}}}. In this state, since {{offset2}} is greater than 
{{{}offset1{}}}, {{ReplicaManager.handleOffsetOutOfRangeError}} erroneously 
returns an {{OffsetOutOfRangeException}} to the client. This happens because 
the system has not yet recognized the new starting offset ({{{}offset3{}}}), 
leading to incorrect handling of fetch requests.
 
 
 
 


> Kafka Tiered Storage Bug: Consumer Fetch Error Due to Delayed 
> localLogStartOffset Update During Segment Deletion
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-16073
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16073
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, Tiered-Storage
>    Affects Versions: 3.6.1
>            Reporter: hzh0425
>            Assignee: hzh0425
>            Priority: Major
>              Labels: KIP-405, kip-405, tiered-storage
>             Fix For: 3.6.1
>
>
> The identified bug in Apache Kafka's tiered storage feature involves a 
> delayed update of {{localLogStartOffset}} in the 
> {{UnifiedLog.deleteSegments}} method, impacting consumer fetch operations. 
> When segments are deleted from the log's memory state, the 
> {{localLogStartOffset}} isn't promptly updated. Concurrently, 
> {{ReplicaManager.handleOffsetOutOfRangeError}} checks if a consumer's fetch 
> offset is less than the {{{}localLogStartOffset{}}}. If it's greater, Kafka 
> erroneously sends an {{OffsetOutOfRangeException}} to the consumer.
> In a specific concurrent scenario, imagine sequential offsets: {{{}offset1 < 
> offset2 < offset3{}}}. A client requests data at {{{}offset2{}}}. While a 
> background deletion process removes segments from memory, it hasn't yet 
> updated the {{LocalLogStartOffset}} from {{offset1}} to {{{}offset3{}}}. 
> Consequently, when the fetch offset ({{{}offset2{}}}) is evaluated against 
> the stale {{offset1}} in {{{}ReplicaManager.handleOffsetOutOfRangeError{}}}, 
> it incorrectly triggers an {{{}OffsetOutOfRangeException{}}}. This issue 
> arises from the out-of-sync update of {{{}localLogStartOffset{}}}, leading to 
> incorrect handling of consumer fetch requests and potential data access 
> errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to