szehon-ho edited a comment on issue #4346:
URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1075746153


   > A few recent issues reported by the community: 
https://github.com/apache/iceberg/issues/4194, 
https://github.com/apache/iceberg/issues/4161. I know there were more issues 
and a few PRs too. Feel free to link anything that may be related.
   
   Found another PR on this:  https://github.com/apache/iceberg/pull/2890
   
   Need to think a bit about CRC.
   
   On first thought, seems the original ignoreScheme/ignoreAuthority proposal 
can solve the problem.  But if the goal is to make RemoveOrphan safe (ex, not 
remove all files if we change HDFS authority like today), then both need to 
default true, and as Ryan says we need to decide the right values for all 
FileSystem we support as it may not be the case for S3.  And personal thought, 
the flag name may cause some confusion (it could be read as remove orphans 
ignores prefix checking files when removing)
   
   Maybe an alternative is to do it the other way, the user has to force the 
delete when scheme/authority do not match to get the old behavior (which should 
be rare), or choose to skip them.
   
   We could do it by distinguishing two sets from the inner Spark job, and have 
it return
   1. If FileSystem file does not match either absolute or relative path of any 
reachable Iceberg file
   2. If FileSystem file matches relative path, but not absolute path of a 
reachable Iceberg file (Before, will be returned as orphans as well and be 
silently deleted).
   
   Then a flag "prefixMismatchMode", "error", "delete", "skip" controls what to 
do with the second set (default=error, throws exception)
   
   This might be an easier UX to me and no need to choose file-system specific 
flags.  The user in this case specifically chooses to skip if prefix no longer 
matches.  But it is worse performance compared to the original proposal in this 
case, as we would return a lot of results to driver just to throw the 
exception, not sure if its worth it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to