hankfanchiu opened a new pull request #3043: URL: https://github.com/apache/iceberg/pull/3043
# Summary Partially revert e4df91e87007c2185453a896d5ff9a57b2a9b0c6 (from #2960) and allow a no-op partition replacement operation to be committed. # Motivation #2895 encountered an exception when attempting to insert overwrite with an empty dataset from Spark. #2960 addressed the issue above by skipping the commit operation entirely (in both Spark 2 and Spark 3). However, we need to be able to differentiate between a no-op commit vs. a lack of attempt to commit. Concretely, we have scheduled Spark pipelines that use Iceberg metadata to track commits and read targeted Iceberg snapshots. We additionally set some `snapshot-property.<custom key>` to externally "name" each snapshot. With #2960, an upstream Spark application skipping a commit would cause the downstream Spark application to fail to find and read the expected Iceberg snapshot by the custom snapshot property. # Testing The test case introduced by #2960 still passes: https://github.com/apache/iceberg/blob/7d6f692937a939ffccb8fa997a91bd49f616eab6/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java#L192-L233 On Spark 2, I've also run an application that saves an empty `Dataset` in overwrite mode, resulting in a new but no-op snapshot: ```json "snapshots" : [ { "snapshot-id" : 1680973636538102330, "timestamp-ms" : 1630102232337, "summary" : { "operation" : "overwrite", "spark.app.id" : "<omitted>", "replace-partitions" : "true", "<custom key>" : "<omitted>", "changed-partition-count" : "0", "total-records" : "0", "total-files-size" : "0", "total-data-files" : "0", "total-delete-files" : "0", "total-position-deletes" : "0", "total-equality-deletes" : "0" }, "manifest-list" : "<omitted>.avro", "schema-id" : 0 } ], ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
