agnes-xinyi-lu opened a new issue, #15001:
URL: https://github.com/apache/iceberg/issues/15001
### Apache Iceberg version
1.6.1
### Query engine
Spark
### Please describe the bug 🐞
### Problem
When multiple processes commit to different branches(or writing to
different waps) of the same Iceberg table concurrently through the REST
catalog, some commits fail with a non-retryable ValidationException when
building TableMetadata on the server side and calling
[addSnapshot](https://github.com/apache/iceberg/blob/60b42ec0550bcb31c1a05000f34b5ca24016221a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L54):
` Cannot add snapshot with sequence number X older than last sequence
number X`
Instead of CommitFailedException, this error is non-retryable , bypassing
automatic retries and wasting compute resources.
Repro in [unit
test.](https://github.com/agnes-xinyi-lu/iceberg/blob/bf1655361a918e1b35d560deaa00859d0c861e79/core/src/test/java/org/apache/iceberg/rest/TestRestCatalogConcurrentWrites.java)
### Root Cause
1. All the snapshots share a global sequence number counter at the table
level, but we don't add extra requirements for such addSnapshot to guarantee
snapshotId>last sequence number.
3. When a commit reaches TableMetadata.addSnapshot(), it fails validation
because another concurrent commit to a different branch already incremented the
global sequence number
4. This validation failure occurs after the requirement checks (because
there is no check) pass, so it's thrown as ValidationException rather than
CommitFailedException
### Relevant work
Previously in OSS there was [similar issue
](https://github.com/apache/iceberg/issues/8390#issuecomment-1703950561)with
replace table, which was fixed/mitigated by checking if the snapshot has a
parent. But in this case it's a normal table update, and we probably don't want
to bypass the check because we want to maintain the order of all snapshots
through the global sequence number.
### Proposed Solution:
**(Note that we need to update the Rest Spec for the requirement)**
Added a new AssertLastSequenceNumber update requirement that validates
sequence number conflicts before the commit is applied.
Will attach the PR in comments.
Behavior After Fix
- Sequence number conflicts are caught early by AssertLastSequenceNumber
requirement
- Conflicts throw CommitFailedException which triggers automatic
client-side retries
- Concurrent commits to different branches eventually succeed through the
retry mechanism
### Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]