zmk-wawa opened a new issue, #14627: URL: https://github.com/apache/iceberg/issues/14627
### Feature Request / Improvement https://github.com/apache/iceberg/blob/4ee507d5788e31c74d5ef77204ef126ae0105981/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java#L453-L496 https://github.com/apache/iceberg/blob/4ee507d5788e31c74d5ef77204ef126ae0105981/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java#L386-L409 When updating Iceberg tables via REST, updateTable relies on optimistic concurrency control (OCC), which causes in-flight requests to fail when the table snapshot drifts from the request’s base snapshot. This behavior is overly strict for two common situations: 1. APPEND operations: Concurrent appends that **do not change the table structure are **semantically commutative**. However, the current commit treats any snapshot drift as a hard conflict, leading to frequent retries and failures under high concurrency. 2. Partition-scoped OVERWRITE operations: If concurrent **changes involve partitions that are disjoint from the overwrite’s target partitions,** the two operations are semantically independent. Nevertheless, the current snapshot-level check still rejects these changes. The end result in write-intensive, multi-writer deployments is excessive **retry traffic, high latency, and even task failures** after many retries. Expected changes: 1. APPEND: If the snapshot drift since the request’s base contains only compatible changes (e.g., other appends; no structural metadata changes), the server should rebase the request onto the latest snapshot and commit, rather than failing immediately. 2. OVERWRITE: If the drifted concurrent changes affect a set of partitions that are disjoint from the overwrite’s target partitions, the server should rebase onto the latest snapshot and commit (this should not be pushed back to the client to re-issue the update). These relaxations preserve correctness because they do not change the final table state relative to a serial ordering of the same operations. Possible implementation: 1. For APPEND requests: after confirming **no structural changes** and detecting snapshot drift, refresh to the latest snapshot and re-commit the append before committing. 2. For OVERWRITE requests: first **determine the overwrite’s target partition set** (e.g., via its filter or known partition keys). Then check which partitions have been affected by drifted concurrent changes since the base snapshot. If these partition sets are disjoint, **refresh to the latest snapshot** and re-commit the overwrite before committing. 3. Provide a mechanism that, under **high APPEND concurrency**, allows each table to use a short-lived server-side queue to **aggregate or serialize APPEND operations** to reduce retries. This should be opt-in, with moderate parameters (small maximum wait time / batch size) to preserve overall concurrency; parameters could even adapt based on historical load. ### Query engine Spark ### Willingness to contribute - [ ] I can contribute this improvement/feature independently - [x] I would be willing to contribute this improvement/feature with guidance from the Iceberg community - [ ] I cannot contribute this improvement/feature at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
