manishmalhotrawork edited a comment on pull request #1466:
URL: https://github.com/apache/iceberg/pull/1466#issuecomment-693752556
> Having a UDF that accepts columns from two relations does not eliminate
the cross join.
>
> I guess we have two options:
>
> * keep the join by file name and replace `contains` condition with another
UDF that would ignore authority
> * replace the existing UDF that produces file names with another UDF that
would produce a scheme and a relative path and then use DataFrame operations.
That way, we will have only one UDF.
>
> ```
> Column pathCond =
actualFileDF.col("relative_path").equalTo(validDataFileDF.col("relative_path"));
> Column schemeEquality =
actualFileDF.col("scheme").equalTo(validDataFileDF.col("scheme"));
> Column schemeCond =
validDataFileDF.col("scheme").isNull().or(schemeEquality);
> Column joinCond = pathCond.and(schemeCond);
> ```
@aokolnychyi thanks for adding dtails.
yeah as our internal discussion and internal PR I have is similar to this.
Though while testing, realized checking name of the `scheme` can also be
troublesome as name of the scheme is configurable like instead of `hdfs` we can
`myhdfs` and same for S3.
I believe not considering `scheme` should also be ok, as there should not be
case where table location is in HDFS but data its pointing to S3 which needs to
be removed or vice versa?
So, either can avoid checking scheme as well, or have a flag to consider
that or not.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]