ConeyLiu opened a new pull request #2890:
URL: https://github.com/apache/iceberg/pull/2890


   `RemoveOrphanFiles` use `actualFileDF leftanti join validFileDF` to 
determine which files should be removed. We will list all the files under the 
provided or table location directory with `FileSystem.listStatus` and create 
the `actualFileDF`. `validFileDF` is created by index those metadata file and 
reference.
   
   However, `FileSystem.listStatus` will `qualify` the given path. For example: 
a path: `hdfs:/path` will be qualified with `hdfs://host:port/path`.  If the 
`warehouse` is set as: `hdfs:/path`:
   
   `validFileDF`:
       hdfs:/path/file1
       hdfs:/path/file2
       hdfs:/path/file3
       ....
   
   `actualFileDF`:
       hdfs://host:port/path/file1
       hdfs://host:port/path/file2
       hdfs://host:port/path/file3
       ....
   
   Then, all the files in `actualFileDF` will be treated as invalid.
   
   In this patch, we only compare the pure path (remove the schema and 
authority) when doing the `leftanti join`.
   
   Updated existed UTs to test it.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to