Hi Krutika and Amogh,

I'm not sure if handling this on the implementation level of a REST catalog
is the best solution. The remove-snapshot update implementation of
iceberg-go, iceberg-rust and iceberg-java all silently remove references to
snapshots. A catalog does not know if the received update was produced by
running ExpireSnapshots or if it is an explicit call, changing the behavior
to fail instead of cleaning up referenced branches means the REST catalog
diverges in behavior from other catalogs.

Assuming that ExpireSnapshots always produces RemoveSnapshotRef alongside
RemoveSnapshots updates, a boolean field could be added to RemoveSnapshots
which would fail instead of doing the silent cleanup of snapshot references.

Kind regards,
Tobias

Am Mi., 11. Feb. 2026 um 11:54 Uhr schrieb Krutika Dhananjay <
[email protected]>:

> Hi Amogh,
>
> Thanks for the response. You're right that ExpireSnapshots correctly
> identifies reachable snapshots at computation time. Java and Go
> implementations both do this. The race window I'm concerned about is a
> change of state between when ExpireSnapshots computes the candidate list
> and when Commit executes.
>
> Specifically:
> 1. ExpireSnapshots loads metadata M1, computes candidates [snap-1]
> (unreferenced in M1)
> 2. Client concurrently commits SetSnapshotRef("feature-branch", snap-1)
> (metadata is now M2)
> 3. Maintenance job's Commit loads fresh metadata M2 and applies
> RemoveSnapshots([snap-1]) against it
> 4. CAS succeeds because maintenance has the latest pointer
>
> The result is that both snap-1 and the newly added feature-branch ref are
> removed, since RemoveSnapshots also cleans up refs pointing to deleted
> snapshots.
>
> Having said that, I do think this is probably best handled at the catalog
> level rather than changing client library semantics or the spec. We'll
> handle it in our catalog's commit path by re-validating the candidate list
> against the freshly loaded metadata before applying the updates.
>
> Thanks,
> Krutika
>
> On Tue, Feb 10, 2026 at 12:03 AM Amogh Jahagirdar <[email protected]>
> wrote:
>
>> Hey Krutika,
>>
>> >1. A maintenance job identifies snap-1 as eligible for expiration.
>> > 2. A client commits SetSnapshotRef("feature-branch", snap-1).
>> > If the client commits first, the maintenance job loads the fresh
>> metadata (which now contains a ref for feature-branch -> snap-1), builds
>> its RemoveSnapshots([snap-1]) update, and commits. The CAS > succeeds
>> because the maintenance job has the latest metadata pointer. But the
>> resulting metadata is corrupt: feature-branch points to a deleted snapshot.
>>
>> This situation shouldn't happen. In the scenario described, after the
>> client commits, it has committed an addition of a reference to a snapshot.
>> The maintenance job performing snapshot removal, should not be removing
>> "snap-1" because it's still referenced (assuming retention is set correctly
>> etc). The Java reference implementation of expire snapshots
>> implementation does a reachability of what
>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L299>
>> snapshots are still referenced, and those will not be eligible for cleanup.
>> Here's the overall algorithm
>> <https://iceberg.apache.org/spec/#snapshot-retention-policy>.
>>
>> Let me know if I misunderstood the scenario or even better if there's a
>> test that can repro the issue, but I don't think we should need another
>> update type, the cleanup logic in snapshot expiration already should not be
>> considering valid referenced snapshots for removal.
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Sun, Feb 8, 2026 at 10:14 PM Krutika Dhananjay <[email protected]>
>> wrote:
>>
>>> Adding Matt Topol to the discussion.
>>>
>>> Thanks,
>>> Krutika
>>>
>>> On Fri, Feb 6, 2026 at 3:55 PM Krutika Dhananjay <[email protected]>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'd like to bring up a race condition between snapshot expiration and
>>>> concurrent client commits that add refs to snapshots.
>>>>
>>>> *The Problem*
>>>> Consider a table with snap-1 (unreferenced) and snap-2 (main). Two
>>>> operations happen concurrently:
>>>>
>>>> 1. A maintenance job identifies snap-1 as eligible for expiration.
>>>> 2. A client commits SetSnapshotRef("feature-branch", snap-1).
>>>>
>>>> If the client commits first, the maintenance job loads the fresh
>>>> metadata (which now contains a ref for feature-branch -> snap-1), builds
>>>> its RemoveSnapshots([snap-1]) update, and commits. The CAS succeeds because
>>>> the maintenance job has the latest metadata pointer. But the resulting
>>>> metadata is corrupt: feature-branch points to a deleted snapshot.
>>>>
>>>> CAS-based optimistic concurrency protects against concurrent pointer
>>>> updates, but it does not validate the semantic correctness of the operation
>>>> against the current state. The maintenance job's commit is structurally
>>>> valid (latest pointer) but semantically invalid (removes a referenced
>>>> snapshot).
>>>>
>>>> *Proposed Fix in iceberg-go*
>>>> We addressed <https://github.com/apache/iceberg-go/pull/716> this in
>>>> iceberg-go by modifying MetadataBuilder.RemoveSnapshots() to skip removing
>>>> any snapshot that is currently referenced by a branch or tag. This works
>>>> correctly -- the validation happens against the same metadata state that
>>>> the CAS will verify hasn't changed.
>>>> However, this changes the existing behavior of RemoveSnapshots, which
>>>> currently removes the snapshot unconditionally and cleans up any refs that
>>>> pointed to it. This would be a deviation from the behaviour in iceberg-java
>>>> and iceberg-rs.
>>>>
>>>> *Proposed Alternative: AssertSnapshotNotReferenced*
>>>> A less invasive approach that preserves the existing RemoveSnapshots
>>>> behavior would be to introduce a new assertion update --
>>>> AssertSnapshotNotReferenced. This would be analogous to existing
>>>> assertions, but for validating that a snapshot has no refs pointing to it.
>>>>
>>>> During commit, a catalog-side maintenance job would include this
>>>> assertion alongside its RemoveSnapshots update:
>>>>
>>>> updates: [
>>>>       AssertSnapshotNotReferenced(snap-1),
>>>>       RemoveSnapshots([snap-1])
>>>> ]
>>>>
>>>> The assertion would be validated against the freshly loaded metadata
>>>> during commit. If a concurrent client had linked a ref to snap-1 in the
>>>> meantime, the assertion fails, the CAS-based commit is never attempted, and
>>>> the maintenance job can retry with updated knowledge of the table state.
>>>>
>>>> This approach:
>>>> - Does not change existing RemoveSnapshots semantics
>>>> - Can be implemented uniformly across Java, Rust, and Go
>>>>
>>>> Would love to hear thoughts on the preferred way to handle this.
>>>>
>>>> Thanks,
>>>> Krutika
>>>>
>>>

Reply via email to