agnes-xinyi-lu opened a new issue, #15001:
URL: https://github.com/apache/iceberg/issues/15001

   ### Apache Iceberg version
   
   1.6.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
     ### Problem
   
     When multiple processes commit to different branches(or writing to 
different waps) of the same Iceberg table concurrently through the REST 
catalog, some commits fail with a non-retryable ValidationException when 
building TableMetadata on the server side and calling 
[addSnapshot](https://github.com/apache/iceberg/blob/60b42ec0550bcb31c1a05000f34b5ca24016221a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L54):
    ` Cannot add snapshot with sequence number X older than last sequence 
number X`
   
     Instead of CommitFailedException, this error is non-retryable ,  bypassing 
automatic retries and wasting compute resources.
   Repro in [unit 
test.](https://github.com/agnes-xinyi-lu/iceberg/blob/bf1655361a918e1b35d560deaa00859d0c861e79/core/src/test/java/org/apache/iceberg/rest/TestRestCatalogConcurrentWrites.java)
 
   
    ### Root Cause
     1. All the snapshots share a global sequence number counter at the table 
level, but we don't add extra requirements for such addSnapshot to guarantee 
snapshotId>last sequence number. 
     3. When a commit reaches TableMetadata.addSnapshot(), it fails validation 
because another concurrent commit to a different branch already incremented the 
global sequence number
     4. This validation failure occurs after the requirement checks (because 
there is no check) pass, so it's thrown as ValidationException rather than 
CommitFailedException
   
   ### Relevant work
   Previously in OSS there was [similar issue 
](https://github.com/apache/iceberg/issues/8390#issuecomment-1703950561)with 
replace table, which was fixed/mitigated by checking if the snapshot has a 
parent. But in this case it's a normal table update, and we probably don't want 
to bypass the check because we want to maintain the order of all snapshots 
through the global sequence number. 
   
    ### Proposed Solution:
   **(Note that we need to update the Rest Spec for the requirement)**
     Added a new AssertLastSequenceNumber update requirement that validates 
sequence number conflicts before the commit is applied. 
   Will attach the PR in comments.
   
     Behavior After Fix
   
     - Sequence number conflicts are caught early by AssertLastSequenceNumber 
requirement
     - Conflicts throw CommitFailedException which triggers automatic 
client-side retries
     - Concurrent commits to different branches eventually succeed through the 
retry mechanism
    
   
   ### Willingness to contribute
   
   - [x] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to