[I] Relax OCC for REST updateTable to tolerate compatible APPENDs and partition-disjoint OVERWRITEs; optional server-side APPEND aggregation [iceberg]

via GitHub Wed, 19 Nov 2025 03:25:18 -0800


zmk-wawa opened a new issue, #14627:
URL: https://github.com/apache/iceberg/issues/14627

### Feature Request / Improvement

https://github.com/apache/iceberg/blob/4ee507d5788e31c74d5ef77204ef126ae0105981/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java#L453-L496

https://github.com/apache/iceberg/blob/4ee507d5788e31c74d5ef77204ef126ae0105981/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java#L386-L409

When updating Iceberg tables via REST, updateTable relies on optimistic
concurrency control (OCC), which causes in-flight requests to fail when the
table snapshot drifts from the request’s base snapshot. This behavior is overly
strict for two common situations:
1. APPEND operations: Concurrent appends that **do not change the table
structure are **semantically commutative**. However, the current commit treats
any snapshot drift as a hard conflict, leading to frequent retries and failures
under high concurrency.
2. Partition-scoped OVERWRITE operations: If concurrent **changes involve
partitions that are disjoint from the overwrite’s target partitions,** the two
operations are semantically independent. Nevertheless, the current
snapshot-level check still rejects these changes.
The end result in write-intensive, multi-writer deployments is excessive
**retry traffic, high latency, and even task failures** after many retries.

Expected changes:
1. APPEND: If the snapshot drift since the request’s base contains only
compatible changes (e.g., other appends; no structural metadata changes), the
server should rebase the request onto the latest snapshot and commit, rather
than failing immediately.
2. OVERWRITE: If the drifted concurrent changes affect a set of partitions
that are disjoint from the overwrite’s target partitions, the server should
rebase onto the latest snapshot and commit (this should not be pushed back to
the client to re-issue the update).
These relaxations preserve correctness because they do not change the final
table state relative to a serial ordering of the same operations.

Possible implementation:
1. For APPEND requests: after confirming **no structural changes** and
detecting snapshot drift, refresh to the latest snapshot and re-commit the
append before committing.
2. For OVERWRITE requests: first **determine the overwrite’s target
partition set** (e.g., via its filter or known partition keys). Then check
which partitions have been affected by drifted concurrent changes since the
base snapshot. If these partition sets are disjoint, **refresh to the latest
snapshot** and re-commit the overwrite before committing.
3. Provide a mechanism that, under **high APPEND concurrency**, allows each
table to use a short-lived server-side queue to **aggregate or serialize APPEND
operations** to reduce retries. This should be opt-in, with moderate parameters
(small maximum wait time / batch size) to preserve overall concurrency;
parameters could even adapt based on historical load.

### Query engine

Spark

### Willingness to contribute

- [ ] I can contribute this improvement/feature independently
- [x] I would be willing to contribute this improvement/feature with
guidance from the Iceberg community
- [ ] I cannot contribute this improvement/feature at this time

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Relax OCC for REST updateTable to tolerate compatible APPENDs and partition-disjoint OVERWRITEs; optional server-side APPEND aggregation [iceberg]

Reply via email to