hankfanchiu opened a new pull request #3043:
URL: https://github.com/apache/iceberg/pull/3043


   # Summary
   
   Partially revert e4df91e87007c2185453a896d5ff9a57b2a9b0c6 (from #2960) and 
allow a no-op partition replacement operation to be committed.
   
   # Motivation
   
   #2895 encountered an exception when attempting to insert overwrite with an 
empty dataset from Spark.
   
   #2960 addressed the issue above by skipping the commit operation entirely 
(in both Spark 2 and Spark 3).
   
   However, we need to be able to differentiate between a no-op commit vs. a 
lack of attempt to commit.
   
   Concretely, we have scheduled Spark pipelines that use Iceberg metadata to 
track commits and read targeted Iceberg snapshots. We additionally set some 
`snapshot-property.<custom key>` to externally "name" each snapshot.
   
   With #2960, an upstream Spark application skipping a commit would cause the 
downstream Spark application to fail to find and read the expected Iceberg 
snapshot by the custom snapshot property.
   
   # Testing
   
   The test case introduced by #2960 still passes:
   
   
https://github.com/apache/iceberg/blob/7d6f692937a939ffccb8fa997a91bd49f616eab6/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java#L192-L233
   
   On Spark 2, I've also run an application that saves an empty `Dataset` in 
overwrite mode, resulting in a new but no-op snapshot:
   
   ```json
     "snapshots" : [ {
       "snapshot-id" : 1680973636538102330,
       "timestamp-ms" : 1630102232337,
       "summary" : {
         "operation" : "overwrite",
         "spark.app.id" : "<omitted>",
         "replace-partitions" : "true",
         "<custom key>" : "<omitted>",
         "changed-partition-count" : "0",
         "total-records" : "0",
         "total-files-size" : "0",
         "total-data-files" : "0",
         "total-delete-files" : "0",
         "total-position-deletes" : "0",
         "total-equality-deletes" : "0"
       },
       "manifest-list" : "<omitted>.avro",
       "schema-id" : 0
     } ],
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to