szehon-ho commented on issue #4346: URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1075746153
> A few recent issues reported by the community: https://github.com/apache/iceberg/issues/4194, https://github.com/apache/iceberg/issues/4161. I know there were more issues and a few PRs too. Feel free to link anything that may be related. Found another PR on this: https://github.com/apache/iceberg/pull/2890 Need to think a bit about CRC. On first thought, seems the original ignoreScheme/ignoreAuthority proposal can solve the problem. But if the goal is to make RemoveOrphan safe (ex, not remove all files if we change HDFS authority like today), then both need to default true, and as Ryan says we need to decide the right values for all FileSystem we support as it may not be the case for S3. And personal thought, the flag name may cause some confusion (it could be read as remove orphans ignores prefix checking files when removing) Maybe an alternative is to do it the other way, the user has to force the delete when scheme/authority do not match to get the old behavior (which should be rare) We could do it by distinguishing two sets from the inner Spark job, and have it return 1. If FileSystem file does not match either absolute or relative path of any reachable Iceberg file 2. If FileSystem file matches relative path, but not absolute path of a reachable Iceberg file (Before, will be returned as orphans as well and be silently deleted). If the latter set is non-null, we could throw an exception unless user specifically enables 'ignoreScheme/Authority' flag, which is default to false. (This would actually match the meaning of the flag a little better to me, as the remove action ignores authority-match checking when removing) This could be safer and a better UX, but worse performance compared to the original proposal, as we would return a lot of results to driver just to throw the exception, not sure what people think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
