HyukjinKwon opened a new pull request #25386: [SPARK-28654][SQL] Move "Extract Python UDFs" to the last in optimizer URL: https://github.com/apache/spark/pull/25386 ## What changes were proposed in this pull request? Plans after "Extract Python UDFs" are very flaky and error-prone to other plans. For instance, if we add some rules, for instance, `PushDownPredicates` in `postHocOptimizationBatches`, the test in `BatchEvalPythonExecSuite` fails: ```scala test("Python UDF refers to the attributes from more than one child") { val df = Seq(("Hello", 4)).toDF("a", "b") val df2 = Seq(("Hello", 4)).toDF("c", "d") val joinDF = df.crossJoin(df2).where("dummyPythonUDF(a, c) == dummyPythonUDF(d, c)") val qualifiedPlanNodes = joinDF.queryExecution.executedPlan.collect { case b: BatchEvalPythonExec => b } assert(qualifiedPlanNodes.size == 1) } ``` test fails: ``` Invalid PythonUDF dummyUDF(a#63, c#74), requires attributes from more than one child. ``` This is because Python UDF extraction optimization is rolled back as below: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicates === !Filter (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18)) Join Cross, (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18)) !+- Join Cross :- Project [_1#2 AS a#7, _2#3 AS b#8] ! :- Project [_1#2 AS a#7, _2#3 AS b#8] : +- LocalRelation [_1#2, _2#3] ! : +- LocalRelation [_1#2, _2#3] +- Project [_1#13 AS c#18, _2#14 AS d#19] ! +- Project [_1#13 AS c#18, _2#14 AS d#19] +- LocalRelation [_1#13, _2#14] ! +- LocalRelation [_1#13, _2#14] ``` Seems we should do Python UDFs cases at the last even after post hoc rules. Note that this actually rather follows the way in previous versions when those were in physical plans (see SPARK-24721 and SPARK-12981). Those optimization rules were supposed to be placed at the end. Note that I intentionally didn't move `ExperimentalMethods` (`spark.experimental.extraStrategies`). This is an explicit experimental API and I wanted to just-in-case workaround after this change for now. ## How was this patch tested? Existing tests should cover.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
