kbendick edited a comment on issue #4346:
URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1083460470


   Agreed that making the source of the "actual files" list pluggable is 
orthogonal. My apologies for bringing it up here, as it's more related to just 
"making `DeleteOrpahFiles` more reliable" by avoiding the list operation on the 
entire object store.
   
   I would propose, since I know that it's mostly working and that it's rather 
simple, that we consider the addition of a way to add in a source other than 
the hadoop-based list as an additional option. Right now, it's simply another 
table that can be referenced that contains the actual files of the file store.
   
   Whether things like prefix normalization would be applied to the listing of 
files in the table or to the list of actual files, that would be outside the 
scope of a user-provided lsit of actual files.
   
   For example for table's entirely on s3 in one bucket, normalization of the 
files in the table on the scheme of `s3a` or `s3` is probably the most common 
concern that average users face in practice.
   
   Whether the normalization to one of `s3` or `s3a` is done one the table's 
file list or the list of actual files, the user could still provide a list of 
actual files from a more definitive source that has been properly adjusted to 
be `s3` or `s3a`.
   
   We can open a PR for review to better show what is meant. But I don't think 
that the normalization work needs to be completed before we make it pluggable 
in this way. The normalization work would naturally layer on top of this as 
this simply skips one small part of the actions `execute` method.
   
   It will be more clear what is meant by putting up the work, but it is a 
rather small change that provides a very significant benefit to a lot of users 
right away - avoiding the listing of the entire file store if a more definitive 
source of truth is available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to