viirya opened a new pull request #25204: [SPARK-28441][SQL][Python] Fix error when PythonUDF is used in correlated scalar subquery URL: https://github.com/apache/spark/pull/25204 ## What changes were proposed in this pull request? In SPARK-15370, We checked the expression at the root of the correlated subquery, in order to fix count bug. If a `PythonUDF` in in the checking path, evaluating it causes the failure as we can't statically evaluate `PythonUDF`. The Python UDF test added at SPARK-28277 shows this issue. If we can statically evaluate the expression, we intercept NULL values coming from the outer join and replace them with the value that the subquery's expression like before, if it is not, we replace them with the `PythonUDF` expression, with statically evaluated parameters. After this, the last query in `udf-except.sql` which throws `java.lang.UnsupportedOperationException` can be run: ``` SELECT t1.k FROM t1 WHERE t1.v <= (SELECT udf(max(udf(t2.v))) FROM t2 WHERE udf(t2.k) = udf(t1.k)) MINUS SELECT t1.k FROM t1 WHERE udf(t1.v) >= (SELECT min(udf(t2.v)) FROM t2 WHERE t2.k = t1.k) -- !query 2 schema struct<k:string> -- !query 2 output two ``` ## How was this patch tested? Added tests.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
