aokolnychyi commented on issue #4346: URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1076756955
@anuragmantri, I am not sure relative paths will help. They will be optional and we will still need to resolve them somehow to compare with absolute paths we get from listing. I assume we will have the same issues without normalization and with different yet equivalent authorities/schemes. @flyrain, it is an interesting idea. However, is there an efficient way to compute these values for all files in a location? I assume it will require a request per every listed file, making this extremely expensive. We may find a way to persist these values for referenced files for future use cases but I am afraid we will need to send a request for every actual file that we get after listing a location and it is going to be costly (not to mention that listing itself is already extremely expensive). @szehon-ho, I can see you idea being implemented. If I understand correctly, it will behave like this: - Build a DF of reachable normalized paths with scheme/authority (if a file does not have either scheme or authority, inherit it from the location we clean). - Build a DF of reachable normalized paths without scheme/authority. - Build a DF of all actual files in the location with scheme/authority. - Find actual locations that don't match reachable normalized paths with scheme/authority. These are potentially orphan files. - Among potentially orphan files, find which of them match reachable files if we ignore their scheme and authority. - Act according to `prefix-mismatch-mode` that can be `error` (default), `delete`, `ignore`. I think it is worth exploring this option. It is also not perfect, though. Cases when the bucket name (authority) is different would result in an exception. Also, are there scenarios where we want to compare authorities but not schemes? What does everybody else think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
