jackye1995 opened a new issue #3212: URL: https://github.com/apache/iceberg/issues/3212
`DeleteOrphanFiles` currently has the following limitations: 1. it assume a single location for all the files, which would not work for tables that stores files in a separated directory 2. it does not work if multiple tables store files under the same root, which would be a common case for people having object storage mode to store all data files at the root location 3. it still requires a file system to perform file listing, so for S3FileIO users they still have to install S3A file system just to remove orphan files My current thoughts are the following: For 1, can be mitigated by running the action multiple times with different root locations, which would probably be enough as a maintenance procedure. For 2, maybe we can add a few filters, which includes: 1. path filter: allowing a regex to know if a path might belong to some other table and should not be removed 2. metadata filter: allowing checks against some specific metadata of a file to know if it might belong to some other table. This also requires adding a new `Map<String, String> metadata()` interface in `InputFile` for `FileIO` to implement. 3. we also have the hard-coded restriction that only lists `at most 3 levels and only dirs that have less than 10 direct sub dirs on the driver`, which I think should be made configurable. For 3, I am not sure what is the best way. I am thinking if we could add listing operation also to `FileIO`, and it would only be used for this action. Would listing be useful for any other potential actions? Or would it cause a potential abuse of file listing? @aokolnychyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
