[ https://issues.apache.org/jira/browse/KAFKA-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
hzh0425 updated KAFKA-16073: ---------------------------- Description: The identified bug in Apache Kafka's tiered storage feature involves a delayed update of {{localLogStartOffset}} in the {{UnifiedLog.deleteSegments}} method, impacting consumer fetch operations. When segments are deleted from the log's memory state, the {{localLogStartOffset}} isn't promptly updated. Concurrently, {{ReplicaManager.handleOffsetOutOfRangeError}} checks if a consumer's fetch offset is less than the {{{}localLogStartOffset{}}}. If it's greater, Kafka erroneously sends an {{OffsetOutOfRangeException}} to the consumer. In a specific concurrent scenario, imagine sequential offsets: {{{}offset1 < offset2 < offset3{}}}. A client requests data at {{{}offset2{}}}. While a background deletion process removes segments from memory, it hasn't yet updated the {{LocalLogStartOffset}} from {{offset1}} to {{{}offset3{}}}. Consequently, when the fetch offset ({{{}offset2{}}}) is evaluated against the stale {{offset1}} in {{{}ReplicaManager.handleOffsetOutOfRangeError{}}}, it incorrectly triggers an {{{}OffsetOutOfRangeException{}}}. This issue arises from the out-of-sync update of {{{}localLogStartOffset{}}}, leading to incorrect handling of consumer fetch requests and potential data access errors. was: This bug pertains to Apache Kafka's tiered storage functionality. Specifically, it involves a timing issue in the {{UnifiedLog.deleteSegments}} method. The method first deletes segments from memory but delays updating the {{{}localLogStartOffset{}}}. Meanwhile, in {{{}ReplicaManager.handleOffsetOutOfRangeError{}}}, if the fetch offset is less than {{{}localLogStartOffset{}}}, it triggers the read remote process. However, if it's greater, an {{OffsetOutOfRangeException}} is sent to the client. Consider a scenario with concurrent operations, where {{{}offset1 < offset2 < offset3{}}}. A client requests {{offset2}} while a background thread is deleting segments. The segments are deleted in memory, but {{LocalLogStartOffset}} is still at {{offset1}} and not yet updated to {{{}offset3{}}}. In this state, since {{offset2}} is greater than {{{}offset1{}}}, {{ReplicaManager.handleOffsetOutOfRangeError}} erroneously returns an {{OffsetOutOfRangeException}} to the client. This happens because the system has not yet recognized the new starting offset ({{{}offset3{}}}), leading to incorrect handling of fetch requests. > Kafka Tiered Storage Bug: Consumer Fetch Error Due to Delayed > localLogStartOffset Update During Segment Deletion > ---------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-16073 > URL: https://issues.apache.org/jira/browse/KAFKA-16073 > Project: Kafka > Issue Type: Bug > Components: core, Tiered-Storage > Affects Versions: 3.6.1 > Reporter: hzh0425 > Assignee: hzh0425 > Priority: Major > Labels: KIP-405, kip-405, tiered-storage > Fix For: 3.6.1 > > > The identified bug in Apache Kafka's tiered storage feature involves a > delayed update of {{localLogStartOffset}} in the > {{UnifiedLog.deleteSegments}} method, impacting consumer fetch operations. > When segments are deleted from the log's memory state, the > {{localLogStartOffset}} isn't promptly updated. Concurrently, > {{ReplicaManager.handleOffsetOutOfRangeError}} checks if a consumer's fetch > offset is less than the {{{}localLogStartOffset{}}}. If it's greater, Kafka > erroneously sends an {{OffsetOutOfRangeException}} to the consumer. > In a specific concurrent scenario, imagine sequential offsets: {{{}offset1 < > offset2 < offset3{}}}. A client requests data at {{{}offset2{}}}. While a > background deletion process removes segments from memory, it hasn't yet > updated the {{LocalLogStartOffset}} from {{offset1}} to {{{}offset3{}}}. > Consequently, when the fetch offset ({{{}offset2{}}}) is evaluated against > the stale {{offset1}} in {{{}ReplicaManager.handleOffsetOutOfRangeError{}}}, > it incorrectly triggers an {{{}OffsetOutOfRangeException{}}}. This issue > arises from the out-of-sync update of {{{}localLogStartOffset{}}}, leading to > incorrect handling of consumer fetch requests and potential data access > errors. -- This message was sent by Atlassian Jira (v8.20.10#820010)