dramaticlly commented on code in PR #14287:
URL: https://github.com/apache/iceberg/pull/14287#discussion_r2418060119


##########
api/src/main/java/org/apache/iceberg/ExpireSnapshots.java:
##########
@@ -119,6 +119,17 @@ public interface ExpireSnapshots extends 
PendingUpdate<List<Snapshot>> {
    */
   ExpireSnapshots cleanExpiredFiles(boolean clean);
 
+  /**
+   * Skip the cleanup of orphaned data files as part of snapshot expiration
+   *
+   * @param retain true to retain orphaned data files only reachable by 
expired snapshots
+   * @return this for method chaining
+   */
+  default ExpireSnapshots retainOrphanedDataFiles(boolean retain) {

Review Comment:
   thanks @amogh-jahagirdar ! We actually explored that option and here's what 
we find
   1. use retainOrphanedDataFiles option actually speed up the clean up process 
by avoiding open and read the manifest files, if only metadata (like 
manifest-list and manifest) are considered for clean up, then we can actually 
skip reading the manifests, which is usually the bottleneck and require work 
distribution. Usually this is handled in Spark action and procedures 
   2. use DeleteWith consumer currently only provides a file path represented 
in String, we can use its file suffix to differentiate metadata and data files, 
but with introduction of #13769, we can no longer rely on `.parquet` alone to 
tell. We can still probably rely on checking `$tablePath/data/` as part of file 
path but this is mostly conventional



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to