kbendick edited a comment on issue #4346: URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1083460470
Agreed that making the source of the "actual files" list pluggable is orthogonal. My apologies for bringing it up here, as it's more related to just "making `DeleteOrpahFiles` more reliable" by avoiding the list operation on the entire object store. I would propose, since I know that it's mostly working and that it's rather simple, that we consider the addition of a way to add in a source other than the hadoop-based list as an additional option. Right now, it's simply another table that can be referenced that contains the actual files of the file store. Whether things like prefix normalization would be applied to the listing of files in the table or to the list of actual files, that would be outside the scope of a user-provided lsit of actual files. For example for table's entirely on s3 in one bucket, normalization of the files in the table on the scheme of `s3a` or `s3` is probably the most common concern that average users face in practice. Whether the normalization to one of `s3` or `s3a` is done one the table's file list or the list of actual files, the user could still provide a list of actual files from a more definitive source that has been properly adjusted to be `s3` or `s3a`. We can open a PR for review to better show what is meant. But I don't think that the normalization work needs to be completed before we make it pluggable in this way. The normalization work would naturally layer on top of this as this simply skips one small part of the actions `execute` method. It will be more clear what is meant by putting up the work, but it is a rather small change that provides a very significant benefit to a lot of users right away - avoiding the listing of the entire file store if a more definitive source of truth is available. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
