aokolnychyi commented on issue #4346: URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1069629988
> I think it's reasonable to have some way to normalize both sets of paths before comparing them. It is a little concerning to me that the proposal for doing that is to use Hadoop's path normalization and then optionally ignore certain parts. Delegating that to Hadoop doesn't seem like a good idea to me, but it is at least a start. This is mainly because we use Hadoop `FileSystem` for listing. If we use proprietary logic to normalize paths, this may lead to handling edge cases differently. Normalization mostly solves cosmetic issues in the path part of URIs, it does not solve the scheme and authority mismatch. I am happy to consider alternatives. In the future, we can customize `DeleteOrphanFiles` so that listing via `FileSystem` just becomes one way to build a list of actual files. For now, using Hadoop for normalization when we are doing listing via Hadoop seems reasonable to me. > I think we should also have strategies for this that depend on the file system scheme. s3 and s3a are essentially the same thing, but you can't ignore authority (bucket) for S3. HDFS may have different namenodes in the authority and whether they're equivalent depends on the Hadoop Configuration. I'd like to get more specific about the file systems that are supported and how each one will normalize and compare. This is when it gets tricky. Hadoop conf in jobs that write to the table can be different from Hadoop conf in jobs that delete orphan files. Users can define arbitrary schemes or use different yet equivalent authorities. Maybe, the default values for the ignore options can depend on the scheme of the location we clean. For example, if the location we scan for orphan files starts with `s3`, we can ignore the scheme but have to compare the authority and normalized path. We can discuss default values more but what about having `ignore-scheme` and `ignore-authority` in general? Do we consider that useful? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
