jackye1995 opened a new issue #3212:
URL: https://github.com/apache/iceberg/issues/3212


   `DeleteOrphanFiles` currently has the following limitations:
   1. it assume a single location for all the files, which would not work for 
tables that stores files in a separated directory
   2. it does not work if multiple tables store files under the same root, 
which would be a common case for people having object storage mode to store all 
data files at the root location
   3. it still requires a file system to perform file listing, so for S3FileIO 
users they still have to install S3A file system just to remove orphan files
   
   My current thoughts are the following:
   
   For 1, can be mitigated by running the action multiple times with different 
root locations, which would probably be enough as a maintenance procedure.
   
   For 2, maybe we can add a few filters, which includes:
   1. path filter: allowing a regex to know if a path might belong to some 
other table and should not be removed
   2. metadata filter: allowing checks against some specific metadata of a file 
to know if it might belong to some other table. This also requires adding a new 
`Map<String, String> metadata()` interface in `InputFile` for `FileIO` to 
implement.
   3. we also have the hard-coded restriction that only lists `at most 3 levels 
and only dirs that have less than 10 direct sub dirs on the driver`, which I 
think should be made configurable.
   
   For 3, I am not sure what is the best way. I am thinking if we could add 
listing operation also to `FileIO`, and it would only be used for this action. 
Would listing be useful for any other potential actions? Or would it cause a 
potential abuse of file listing?
   
   @aokolnychyi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to