manishmalhotrawork commented on pull request #1471:
URL: https://github.com/apache/iceberg/pull/1471#issuecomment-694679340
> I think this logic can be simplified a bit. My first thought was to
leverage the following code:
>
> ```
> private static final UserDefinedFunction relativePathUDF =
functions.udf((String location) -> {
> Path fullyQualifiedPath = new Path(location);
> return fullyQualifiedPath.toUri().getPath();
> }, DataTypes.StringType);
>
> ...
>
> Dataset<Row> validFileDF =
withRelativePathColumn(validDataFileDF.union(validMetadataFileDF));
> Dataset<Row> actualFileDF = withRelativePathColumn(buildActualFileDF());
>
> ...
>
> Column joinCond =
actualFileDF.col("relative_path").equalTo(validFileDF.col("relative_path"));
> return actualFileDF.join(validFileDF, joinCond,
"leftanti").select("file_path")
> .as(Encoders.STRING())
> .collectAsList();
> ```
>
> We wouldn't need the filename UDF and it would be very straightforward.
Unfortunately, that does not seem to work for certain locations. For example,
`path.toUri().getPath()` for
`hdfs://user/location/sublocation/filename.parquet` will return
`location/sublocation/filename.parquet` and `user` is considered as authority.
>
> Even if we have to keep `contains` and equality of file names, I think we
can still leverage a single UDF:
>
> ```
> private static final UserDefinedFunction fileDetailUDF =
functions.udf((String location) -> {
> Path fullyQualifiedPath = new Path(location);
> String fileName = fullyQualifiedPath.getName();
> String relativePath = fullyQualifiedPath.toUri().getPath();
> return RowFactory.create(fileName, relativePath);
> }, FILE_DETAIL_STRUCT);
> ```
>
> Then our join condition will be == on file names and contains on relative
locations.
thanks @aokolnychyi !
We can keep one UDF for both the fileName and paths.
Though I think, we would need to keep `fullyQualifiedPath` as well, because
it's required as orphan files path, to be deleted.
For join conditions, these 2 columns should be good.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]