rdblue commented on issue #4346: URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1069567047
I think about this problem slightly differently. Rather than thinking about it as lacking normalization, the underlying file store may have multiple ways to refer to the same file. That's a bit more broad and covers the cases like schemes that mean the same thing (s3 and s3a) and authority issues. I think it's reasonable to have some way to normalize both sets of paths before comparing them. It is a little concerning to me that the proposal for doing that is to use Hadoop's path normalization and then optionally ignore certain parts. Delegating that to Hadoop doesn't seem like a good idea to me, but it is at least a start. I think we should also have strategies for this that depend on the file system scheme. s3 and s3a are essentially the same thing, but you can't ignore authority (bucket) for S3. HDFS may have different namenodes in the authority and whether they're equivalent depends on the Hadoop Configuration. I'd like to get more specific about the file systems that are supported and how each one will normalize and compare. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
