peter-toth commented on a change in pull request #30203:
URL: https://github.com/apache/spark/pull/30203#discussion_r516528651



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala
##########
@@ -218,13 +218,22 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with 
PredicateHelper {
     }
   }
 
+  private def canonicalizeDeterministic(u: PythonUDF) = {

Review comment:
       I see your point but if we can't rely on the deterministic flag then we 
have serious issues in other rules as well. Please consider the example in the 
description. I think column `d` should be equal to `c` no matter if the UDF is 
deterministic or not. The `CollapseProject` rule duplicates the UDF call 
(because it is flagged as deterministic) and that's why we have 2 instances 
when `ExtractPythonUDFs` starts to run.
   
   The following happens without this PR:
   ```
   === Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
   !Project [a#7, b#8, c#12, c#12 AS d#16]         Project [_1#2 AS a#7, _2#3 
AS b#8, dummyUDF(_1#2) AS c#12, dummyUDF(_1#2) AS d#16]
   !+- Project [a#7, b#8, dummyUDF(a#7) AS c#12]   +- LocalRelation [_1#2, _2#3]
   !   +- Project [_1#2 AS a#7, _2#3 AS b#8]       
   !      +- LocalRelation [_1#2, _2#3] 
   
   ...
   
   === Applying Rule org.apache.spark.sql.execution.python.ExtractPythonUDFs ===
   !Project [_1#2 AS a#7, _2#3 AS b#8, dummyUDF(_1#2) AS c#12, dummyUDF(_1#2) 
AS d#16]   Project [_1#2 AS a#7, _2#3 AS b#8, pythonUDF1#22 AS c#12, 
pythonUDF1#22 AS d#16]
   !+- LocalRelation [_1#2, _2#3]                                               
         +- BatchEvalPython [dummyUDF(_1#2), dummyUDF(_1#2)], [pythonUDF0#21, 
pythonUDF1#22]
   !                                                                            
            +- LocalRelation [_1#2, _2#3]
   ```
   which can mean different values for `c` and `d` if the UDF is 
non-deterministic but is flagged deterministic.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to