SamWheating opened a new issue, #14953:
URL: https://github.com/apache/iceberg/issues/14953

   ### Apache Iceberg version
   
   1.10.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have been experimenting with various write-audit-publish workflows and I 
noticed this potentially hazardous behaviour when using `spark.wap.id` to stage 
changes.  
   
   If multiple snapshots are created with the same `wap.id`, then the 
`publish_changes` procedure will only cherry-pick the earliest matching 
snapshot, silently ignoring any other staged changes.
   
   Reproduction, using the spark-sql quickstart:
   ```sql
   CREATE TABLE demo.default.wap_example
       (country string, population bigint)
   USING ICEBERG
   PARTITIONED BY (country)
   TBLPROPERTIES ('write.wap.enabled'='true');
   
   SET spark.wap.id=wap_with_two_snapshots;
   
   -- write two rows into two partitions (will create two snapshots with the 
same wap.id)
   INSERT INTO demo.default.wap_example VALUES ('Canada', 40000000);
   INSERT INTO demo.default.wap_example VALUES ('USA', 340000000);
   
   CALL demo.system.publish_changes('demo.default.wap_example', 
'wap_with_two_snapshots');
   
   SELECT * from demo.default.wap_example;
   
   -- Canada    40000000
   -- Time taken: 0.052 seconds, Fetched 1 row(s)
   ```
   
   This makes sense looking at the procedure's definition 
[here](https://github.com/apache/iceberg/blob/00046005889c66ffc860bae012d7fc560e8f040a/spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/procedures/PublishChangesProcedure.java#L102-L105),
 but I think that this is potentially harmful as it can result in write being 
silently lost.
   
   Based on the [validation performed when cherry-picking 
snapshots](https://github.com/apache/iceberg/blob/00046005889c66ffc860bae012d7fc560e8f040a/core/src/main/java/org/apache/iceberg/util/WapUtil.java#L42-L60),
 it looks like its expected that `wap.id` will be unique among snapshots. In 
this case I think we should raise an error during the `publish_changes` 
procedure if there are multiple matching snapshots to prevent any ambiguity.
   
   I'd be happy to work on a fix, but I want to ensure that my understanding is 
correct here.
   
   ### Willingness to contribute
   
   - [x] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to