aokolnychyi edited a comment on pull request #1471:
URL: https://github.com/apache/iceberg/pull/1471#issuecomment-694612728


   I think this logic can be simplified a bit. My first thought was to leverage 
the following code:
   
   ```
     private static final UserDefinedFunction relativePathUDF = 
functions.udf((String location) -> {
       Path fullyQualifiedPath = new Path(location);
       return fullyQualifiedPath.toUri().getPath();
     }, DataTypes.StringType);
   
     ...
   
     Dataset<Row> validFileDF = 
withRelativePathColumn(validDataFileDF.union(validMetadataFileDF));
     Dataset<Row> actualFileDF = withRelativePathColumn(buildActualFileDF());
   
     ...
   
     Column joinCond = 
actualFileDF.col("relative_path").equalTo(validFileDF.col("relative_path"));
     return actualFileDF.join(validFileDF, joinCond, 
"leftanti").select("file_path")
         .as(Encoders.STRING())
         .collectAsList();
   ```
   
   We wouldn't need the filename UDF and it would be very straightforward. 
Unfortunately, that does not seem to work for certain locations. For example, 
`path.toUri().getPath()` for 
`hdfs://user/location/sublocation/filename.parquet` will return 
`location/sublocation/filename.parquet` and `user` is considered as authority.
   
   Even if we have to keep `contains` and equality of file names, I think we 
can still leverage a single UDF:
   
   ```
     private static final UserDefinedFunction fileDetailUDF = 
functions.udf((String location) -> {
       Path fullyQualifiedPath = new Path(location);
       String fileName = fullyQualifiedPath.getName();
       String relativePath = fullyQualifiedPath.toUri().getPath();
       return RowFactory.create(fileName, relativePath);
     }, FILE_DETAIL_STRUCT);
   ```
   
   Then our join condition will be == on file names and contains on relative 
locations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to