rdblue commented on code in PR #4503:
URL: https://github.com/apache/iceberg/pull/4503#discussion_r843300416


##########
api/src/main/java/org/apache/iceberg/actions/DeleteOrphanFiles.java:
##########
@@ -80,6 +80,19 @@
    */
   DeleteOrphanFiles executeDeleteWith(ExecutorService executorService);
 
+  /**
+   * Passes a table which contains the list of actual files in the table. This 
skips the directory listing - any
+   * files in the actualFilesTable provided which are not found in table 
metadata will be deleted. Not compatible
+   * with `location` or `older_than` arguments - this assumes that the 
provided table of actual files has been
+   * filtered down to the table’s location and only includes files older than 
a reasonable retention interval.
+   *
+   * @param tableName the table containing the actual files dataset.  Should 
have a single `file_path` string column
+   * @return this for method chaining
+   */
+  default DeleteOrphanFiles actualFilesTable(String tableName) {

Review Comment:
   I think it is a good idea to be able to use a temporary table for the stored 
procedure, but for the action API I would expect to be able to pass in a 
`Dataset` of some kind. `Dataset<String>` would work, and we could also expose 
a couple of Java classes to make `Dataset<FileLocationAndModifiedTime>` and 
`Dataset<FileLocation>` work.
   
   I think that we can also make what's happening here a bit more clear in the 
API by naming the method something like `compareToFileList(Dataset<String>)`:
   
   ```java
   Dataset<String> files = ...
   SparkActions.get()
       .deleteOrphanFiles(catalog.loadTable(identifier))
       .compareToFileList(files)
       .execute();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to