aokolnychyi commented on issue #4346:
URL: https://github.com/apache/iceberg/issues/4346#issuecomment-1098375342

   @kbendick, I agree it is useful to supply locations instead of relying on 
listing. I believe there is an open PR that can be merged prior to any work 
discussed here.
   
   @karuppayya and I spent some time discussing and I personally think 
@szehon-ho's idea with having an error mode is quite promising. I'd probably 
have only `error` and `ignore` modes and combine it with other ideas mentioned 
on this thread. 
   
   - Normalize the path part of URIs to avoid cosmetic differences like extra 
slashes
   - Introduce `prefix-mismatch-mode` option. Possible values are `error` 
(default) and `ignore`.
   - Expose ways to influence the comparison. For instance, allow passing 
equivalent schemes.
   
   I like this approach because it will throw an exception if something 
suspicious happens and will provide a user ways to resolve conflicts instead of 
silently taking some action.
   
   The actual algorithm can be like this:
   
   - Build actual file DF
       - Either provided by the user or acquired via listing. If listing, the 
location must contain a scheme and authority.
   - Build reachable file DF via metadata tables
   - Transform both actual and reachable DFs so that they contain `scheme`, 
`authority`, `path` columns.
   - Perform LEFT OUTER JOIN on `path` and map partitions.
   
   ```
   | actual_scheme | actual_authority | path | valid_scheme | valid_authority | 
path |
    
---------------------------------------------------------------------------------
   s3, bucket1, p1, null, null, p1 -> not orphan (null scheme/authority in 
metadata match any scheme/authority)
   s3, bucket1, p2, s3a, bucket1, p2 -> not orphan (must have defaults for 
equivalent schemes like s3 and s3a)
   s3, bucket1, p3, s3a, bucket2, p3 -> error by default and can be either 
ignored or the user may indicate that bucket1 and bucket2 are different, which 
will make s3, bucket1, p3 orphan. 
   ```
   
   Any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to