[GitHub] [iceberg] aokolnychyi commented on issue #4346: Make DeleteOrphanFiles in Spark reliable

GitBox Wed, 16 Mar 2022 14:03:31 -0700


aokolnychyi commented on issue #4346:
URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1069629988



   > I think it's reasonable to have some way to normalize both sets of paths 
before comparing them. It is a little concerning to me that the proposal for 
doing that is to use Hadoop's path normalization and then optionally ignore 
certain parts. Delegating that to Hadoop doesn't seem like a good idea to me, 
but it is at least a start.
   
   This is mainly because we use Hadoop `FileSystem` for listing. If we use 
proprietary logic to normalize paths, this may lead to handling edge cases 
differently. Normalization mostly solves cosmetic issues in the path part of 
URIs, it does not solve the scheme and authority mismatch.
   
   I am happy to consider alternatives. In the future, we can customize 
`DeleteOrphanFiles` so that listing via `FileSystem` just becomes one way to 
build a list of actual files. For now, using Hadoop for normalization when we 
are doing listing via Hadoop seems reasonable to me.
   
   > I think we should also have strategies for this that depend on the file 
system scheme. s3 and s3a are essentially the same thing, but you can't ignore 
authority (bucket) for S3. HDFS may have different namenodes in the authority 
and whether they're equivalent depends on the Hadoop Configuration. I'd like to 
get more specific about the file systems that are supported and how each one 
will normalize and compare.
   
   This is when it gets tricky. Hadoop conf in jobs that write to the table can 
be different from Hadoop conf in jobs that delete orphan files. Users can 
define arbitrary schemes or use different yet equivalent authorities.
   
   Maybe, the default values for the ignore options can depend on the scheme of 
the location we clean. For example, if the location we scan for orphan files 
starts with `s3`, we can ignore the scheme but have to compare the authority 
and normalized path. We can discuss default values more but what about having 
`ignore-scheme` and `ignore-authority` in general? Do we consider that useful?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on issue #4346: Make DeleteOrphanFiles in Spark reliable

Reply via email to