Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22326#discussion_r220516732
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
---
@@ -152,3 +153,51 @@ object EliminateOuterJoin extends Rule[LogicalPlan]
with PredicateHelper {
if (j.joinType == newJoinType) f else Filter(condition,
j.copy(joinType = newJoinType))
}
}
+
+/**
+ * Correctly handle PythonUDF which need access both side of join side by
changing the new join
+ * type to Cross.
+ */
+object PullOutPythonUDFInJoinCondition extends Rule[LogicalPlan] with
PredicateHelper {
+ def hasPythonUDF(expression: Expression): Boolean = {
+ expression.collectFirst { case udf: PythonUDF => udf }.isDefined
--- End diff --
This doesn't matter. We can't evaluate python udf in the join condition,
and need to pull it out, that's all.
For python udf accessing attributes from only one side, these would be
pushed down by other rules. If they don't (e.g. user disables filter pushdown
rule), we need to pull them out here, too. Anyway it's orthogonal to this rule.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]