[GitHub] [iceberg] rdblue commented on issue #4346: Make DeleteOrphanFiles in Spark reliable

GitBox Wed, 16 Mar 2022 13:07:42 -0700


rdblue commented on issue #4346:
URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1069567047



   I think about this problem slightly differently. Rather than thinking about 
it as lacking normalization, the underlying file store may have multiple ways 
to refer to the same file. That's a bit more broad and covers the cases like 
schemes that mean the same thing (s3 and s3a) and authority issues.
   
   I think it's reasonable to have some way to normalize both sets of paths 
before comparing them. It is a little concerning to me that the proposal for 
doing that is to use Hadoop's path normalization and then optionally ignore 
certain parts. Delegating that to Hadoop doesn't seem like a good idea to me, 
but it is at least a start.
   
   I think we should also have strategies for this that depend on the file 
system scheme. s3 and s3a are essentially the same thing, but you can't ignore 
authority (bucket) for S3. HDFS may have different namenodes in the authority 
and whether they're equivalent depends on the Hadoop Configuration. I'd like to 
get more specific about the file systems that are supported and how each one 
will normalize and compare.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on issue #4346: Make DeleteOrphanFiles in Spark reliable

Reply via email to