aokolnychyi commented on issue #4346:
URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1076756955


   @anuragmantri, I am not sure relative paths will help. They will be optional 
and we will still need to resolve them somehow to compare with absolute paths 
we get from listing. I assume we will have the same issues without 
normalization and with different yet equivalent authorities/schemes.
   
   @flyrain, it is an interesting idea. However, is there an efficient way to 
compute these values for all files in a location? I assume it will require a 
request per every listed file, making this extremely expensive. We may find a 
way to persist these values for referenced files for future use cases but I am 
afraid we will need to send a request for every actual file that we get after 
listing a location and it is going to be costly (not to mention that listing 
itself is already extremely expensive).
   
   @szehon-ho, I can see you idea being implemented. If I understand correctly, 
it will behave like this:
   
   - Build a DF of reachable normalized paths with scheme/authority (if a file 
does not have either scheme or authority, inherit it from the location we 
clean).
   - Build a DF of reachable normalized paths without scheme/authority.
   - Build a DF of all actual files in the location with scheme/authority.
   - Find actual locations that don't match reachable normalized paths with 
scheme/authority. These are potentially orphan files.
   - Among potentially orphan files, find which of them match reachable files 
if we ignore their scheme and authority.
   - Act according to `prefix-mismatch-mode` that can be `error` (default), 
`delete`, `ignore`.
   
   I think it is worth exploring this option. It is also not perfect, though. 
Cases when the bucket name (authority) is different would result in an 
exception. Also, are there scenarios where we want to compare authorities but 
not schemes?
   
   What does everybody else think?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to