[GitHub] [spark] viirya opened a new pull request #25215: [SPARK-28445][SQL][Python] Fix error when PythonUDF is used in both group by and aggregate expression

GitBox Sat, 20 Jul 2019 20:09:51 -0700

viirya opened a new pull request #25215: [SPARK-28445][SQL][Python] Fix error 
when PythonUDF is used in both group by and aggregate expression
URL: https://github.com/apache/spark/pull/25215
 
 
   ## What changes were proposed in this pull request?
   
   When PythonUDF is used in group by, and it is also in aggregate expression, 
like
   
   ```
   SELECT pyUDF(a + 1), COUNT(b) FROM testData GROUP BY pyUDF(a + 1)
   ```
   
   It causes analysis exception in `CheckAnalysis`, like
   ```
   org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither 
present in the group by, nor is it an aggregate function.
   ```
   
   First, `CheckAnalysis` can't check semantic equality between PythonUDFs.
   Second, even we make it possible, runtime exception will be thrown
   
   ```
   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: pythonUDF1#8615
   ...
   Cause: java.lang.RuntimeException: Couldn't find pythonUDF1#8615 in 
[cast(pythonUDF0#8614 as int)#8617,count(b#8599)#8607L]
   ```
   
   The cause is, `ExtractPythonUDFs` extracts both PythonUDFs in group by and 
aggregate expression. The PythonUDFs are two different aliases now in the 
logical aggregate. In runtime, we can't bind the resulting expression in 
aggregate to its grouping and aggregate attributes.
   
   This patch proposes a rule `ExtractGroupingPythonUDFFromAggregate` to 
extract PythonUDFs in group by and evaluate them before aggregate. We replace 
the group by PythonUDF in aggregate expression with aliased result.
   
   ## How was this patch tested?
   
   Added tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya opened a new pull request #25215: [SPARK-28445][SQL][Python] Fix error when PythonUDF is used in both group by and aggregate expression

Reply via email to