Adam Horky created KAFKA-20732:
----------------------------------
Summary: Data loss: tiered size-retention deletes in-retention
data from both tiers after leader election when highestOffsetInRemoteStorage is
unset (-1)
Key: KAFKA-20732
URL: https://issues.apache.org/jira/browse/KAFKA-20732
Project: Kafka
Issue Type: Bug
Components: Tiered-Storage
Affects Versions: 3.9.0, 4.3.0
Reporter: Adam Horky
Size-based remote retention deletes the oldest data when
{{onlyLocalLogSegmentsSize + remoteLogSizeBytes > retention.bytes}}.
{{onlyLocalLogSegmentsSize}} sums segments with {{baseOffset >=
highestOffsetInRemoteStorage()}}. That field is in-memory, initialised to
{{-1}}, not recovered on startup, and seeded only by the copy and follower
tasks.
After a leader election the copy and expiration tasks both start at {{-1}} and
are scheduled with {{initialDelay = 0}} on separate thread pools, so they race.
If the expiration task runs first, the filter {{baseOffset >= -1}} matches the
entire local log, and those segments are also counted in {{remoteLogSizeBytes}}
— the same offsets counted twice. The apparent size roughly doubles,
{{retention.bytes}} appears breached, and the broker deletes the oldest remote
segments and advances {{logStartOffset}}. {{logStartOffset}} replicates to
followers, so each replica deletes local segments below the new floor.
In-retention data is removed from both tiers on all replicas; consumers receive
{{OFFSET_OUT_OF_RANGE}} for the removed offsets.
The follower path already seeds this offset before it can cause harm:
{{RLMFollowerTask.execute()}} does so with the comment _"so that the local log
segments are not deleted before they are copied to remote storage."_ The
leader/expiration path, which runs size-retention, has no equivalent step.
h2. Example — one affected partition
A 24-partition test topic at its cap ({{retention.bytes = 134217728}} = 128
MiB, {{local.retention.bytes = -2}}, {{retention.ms = -1}}); both tiers held
the full window (remote ≈ local ≈ cap, copy-lag 0). One broker is restarted and
becomes the fresh leader of partition {{...-1}}; ~40 s later its expiration
task runs while {{highestOffsetInRemoteStorage}} is still {{-1}}:
{code}
# leader shared-sasl-kafka-1 — size-retention on a partition whose unique data
is ~128 MiB:
08:43:59.420 RLMExpirationTask ...repro.v1-1] About to delete remote log
segment ... due to
retention size 134217728 breach. Log size after deletion will be
282854919
# cap = 134217728 (128 MiB); computed ~270 MiB = local(~128) +
remote(~128), counted twice
# ... deletes 9 remote segments, recomputing downward until
just above the cap:
08:43:59.430 ... due to retention size 134217728 breach. Log size after
deletion will be 149820412
08:43:59.430 UnifiedLog ...repro.v1-1] Incremented log start offset to 289616
due to segment deletion
# the new log-start floor replicates to the followers:
08:43:59.680 shared-sasl-kafka-0 ...repro.v1-1] Incremented log start offset
to 289616 due to leader offset increment
08:44:36.319 shared-sasl-kafka-2 ...repro.v1-1] Incremented log start offset
to 289616 due to leader offset increment
# each replica then deletes its local segments below the floor — data now gone
from BOTH tiers:
08:44:21.146 UnifiedLog ...repro.v1-1] Deleting segments due to local log
start offset 289616 breach:
LogSegment(baseOffset=144975,...), ... ,(baseOffset=273618,...)
# 9 segments, then LocalLog "Deleting segment files"
{code}
{{logStartOffset}} jumped 144975 → 289616: ~144,641 records (~141 MiB) removed
from both tiers, even though the true unique size (~128 MiB) was within the cap.
h2. Reproductions
* *Staging (3.9.0, KRaft, RF=3, Aiven RSM):* a rolling restart over-deleted 6
of 12 partitions of a topic that sat at its {{retention.bytes}}; on each, the
new {{logStartOffset}} equalled {{highestCopiedRemoteOffset + 1}} (the whole
copied window expired).
* *Test cluster:* the topic above, rolling-restarted, over-deleted 13 of 24
partitions (example shown). A control topic with {{retention.bytes = -1}}
(size-retention disabled) under the identical restart was unaffected.
* *Unit test (trunk and 3.9.0):* two tests with identical remote metadata and
{{retention.bytes}}, differing only in {{onlyLocalLogSegmentsSize}} ({{0}} when
seeded → no deletion; whole-log when {{-1}} → both segments deleted). A
companion {{UnifiedLogTest}} confirms the real {{UnifiedLog}} returns the whole
local log at {{-1}}.
h2. Fix
Seeding {{highestOffsetInRemoteStorage}} from {{__remote_log_metadata}} before
size-retention (mirroring the follower path) makes the unit test pass and
removes the over-deletion on the test cluster (same fill + restart, zero
{{logStartOffset}} jumps).
* failing test (repro, no fix), diff vs trunk:
[https://github.com/apache/kafka/compare/trunk...horkyada:kafka:tiered-size-retention-unseeded-highest-offset-repro]
* proposed fix, diff vs trunk:
[https://github.com/apache/kafka/compare/trunk...horkyada:kafka:fix-seed-highest-remote-offset]
(Happy to open a PR against apache/kafka once I have an ICLA.)
h2. Scope
Size-retention only — {{retention.ms}} is unaffected (time-retention never
reads {{onlyLocalLogSegmentsSize}}/{{highestOffsetInRemoteStorage}}). The loss
requires a populated remote tier on a topic at/near its cap (with {{R ≈ 0}}
there is nothing to double-count). Magnitude scales with the local-tier size;
with the default {{local.retention.bytes = -2}} it is the whole window.
h2. Related issues (same area, different root cause — not a duplicate)
* *KAFKA-17212* — same method, but a {{>=}}-vs-{{>}} off-by-one double-counting
a _single_ segment while the offset is correct; {{>=}}→{{>}} does not fix this
(with {{-1}}, {{baseOffset > -1}} still matches everything). This is a
whole-log over-count.
* *KAFKA-16711* — same {{-1}} stale-offset class, different trigger (logDir
altering), and it _blocks_ cleanup rather than over-deleting.
* *KAFKA-20148* — same {{RLMExpirationTask}} data-loss class, but the trigger
is _disabling_ remote storage (cancel-vs-execute). This needs no disable.
(Happy to provide the source line-by-line walk-through and the full
reproduction setup if useful.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)