[ 
https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Kimmel updated SPARK-38983:
---------------------------------
    Description: 
h1. In a nutshell

Pyspark emits an incorrect error message when committing a type error with the 
results of the {{grouping()}} function.
h1. Code to reproduce

{{print(spark.version) # My environment, Azure DataBricks, defines spark 
automatically.}}
{{from pyspark.sql import functions as f}}
{{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
{{  ('a',),}}
{{  ('b',),}}
{{]}}
{{s = t.StructType([}}
{{  t.StructField('col1', t.StringType())}}
{{])}}
{{df = spark.createDataFrame(l, s)}}
{{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
{{  df}}
{{  .cube(f.col('col1'))}}
{{  .agg(f.grouping('col1') & f.lit(True))}}
{{  .collect()}}
{{)}}
h1. Expected results

The code produces an {{AnalysisException()}} with error message along the lines 
of:
{{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data 
type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and 
boolean).;}}
h1. Actual results

The code throws an {{AnalysisException()}} with error message
{{AnalysisException: grouping() can only be used with 
GroupingSets/Cube/Rollup;}}

Python provides the following traceback:
{{---------------------------------------------------------------------------}}
{{AnalysisException                         Traceback (most recent call last)}}
{{<command-2283735107422632> in <module>}}
{{     15 }}
{{     16 ( # This expression raises an AnalysisException()}}
{{---> 17   df}}
{{     18   .cube(f.col('col1'))}}
{{{}     19   .agg(f.grouping('col1') & 
f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in agg(self, 
*exprs){}}}
{{    116             # Columns}}
{{    117             assert all(isinstance(c, Column) for c in exprs), "all 
exprs should be Column"}}
{{--> 118             jdf = self._jgd.agg(exprs[0]._jc,}}
{{    119                                 _to_seq(self.sql_ctx._sc, [c._jc for 
c in exprs[1:]]))}}
{{{}    120         return DataFrame(jdf, 
self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
 in {_}{{_}}call{{_}}{_}(self, *args){}}}
{{   1302 }}
{{   1303         answer = self.gateway_client.send_command(command)}}
{{-> 1304         return_value = get_return_value(}}
{{   1305             answer, self.gateway_client, self.target_id, self.name)}}
{{   1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}}
{{    121                 # Hide where the exception came from that shows a 
non-Pythonic}}
{{    122                 # JVM exception message.}}
{{--> 123                 raise converted from None}}
{{    124             else:}}
{{{}    125                 raise{}}}{{{}AnalysisException: grouping() can only 
be used with GroupingSets/Cube/Rollup;{}}}
{{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) 
AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS 
(grouping(col1) AND true)#551]}}
{{+- LogicalRDD [col1#548|#548], false}}
h1. Workaround

_Note:_ The reason I opened this ticket is that, when the user makes a 
particular type error, the resulting error message is misleading. The code 
snippet below shows how to fix that type error. It does not address the 
false-error-message bug, which is the focus of this ticket.

Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ that 
{{.grouping()}} produces an integer 0 or 1 rather than a boolean True or False.

{{(  # This expression does not raise an AnalysisException()}}
{{  df}}
{{  .cube(f.col('col1'))}}
{{  .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
{{  .collect()}}
{{)}}
h1. Additional notes

The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code to 
reproduce".

The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} in 
"Code to reproduce".
h1. Related tickets

https://issues.apache.org/jira/browse/SPARK-22748
h1. Relevant documentation
 * [Spark SQL GROUPBY, ROLLUP, and CUBE 
semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
 * 
[DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
 * 
[DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
 * 
[DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
 * 
[functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
 * 
[functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]

 

  was:
h1. Code to reproduce

{{print(spark.version) # My environment, Azure DataBricks, defines spark 
automatically.}}
{{from pyspark.sql import functions as f}}
{{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
{{  ('a',),}}
{{  ('b',),}}
{{]}}
{{s = t.StructType([}}
{{  t.StructField('col1', t.StringType())}}
{{])}}
{{df = spark.createDataFrame(l, s)}}
{{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
{{  df}}
{{  .cube(f.col('col1'))}}
{{  .agg(f.grouping('col1') & f.lit(True))}}
{{  .collect()}}
{{)}}
h1. Expected results

The code produces an {{AnalysisException()}} with error message along the lines 
of:
{{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data 
type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and 
boolean).;}}
h1. Actual results

The code throws an {{AnalysisException()}} with error message
{{AnalysisException: grouping() can only be used with 
GroupingSets/Cube/Rollup;}}

Python provides the following traceback:
{{---------------------------------------------------------------------------}}
{{AnalysisException                         Traceback (most recent call last)}}
{{<command-2283735107422632> in <module>}}
{{     15 }}
{{     16 ( # This expression raises an AnalysisException()}}
{{---> 17   df}}
{{     18   .cube(f.col('col1'))}}
{{{}     19   .agg(f.grouping('col1') & 
f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in agg(self, 
*exprs){}}}
{{    116             # Columns}}
{{    117             assert all(isinstance(c, Column) for c in exprs), "all 
exprs should be Column"}}
{{--> 118             jdf = self._jgd.agg(exprs[0]._jc,}}
{{    119                                 _to_seq(self.sql_ctx._sc, [c._jc for 
c in exprs[1:]]))}}
{{{}    120         return DataFrame(jdf, 
self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
 in _{_}call{_}_(self, *args){}}}
{{   1302 }}
{{   1303         answer = self.gateway_client.send_command(command)}}
{{-> 1304         return_value = get_return_value(}}
{{   1305             answer, self.gateway_client, self.target_id, self.name)}}
{{   1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}}
{{    121                 # Hide where the exception came from that shows a 
non-Pythonic}}
{{    122                 # JVM exception message.}}
{{--> 123                 raise converted from None}}
{{    124             else:}}
{{{}    125                 raise{}}}{{{}AnalysisException: grouping() can only 
be used with GroupingSets/Cube/Rollup;{}}}
{{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) 
AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS 
(grouping(col1) AND true)#551]}}
{{+- LogicalRDD [col1#548|#548], false}}
h1. Workaround

Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ that 
{{.grouping()}} produces an integer 0 or 1 rather than a boolean True or False.

{{(  # This expression does not raise an AnalysisException()}}
{{  df}}
{{  .cube(f.col('col1'))}}
{{  .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
{{  .collect()}}
{{)}}
h1. Additional notes

The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code to 
reproduce".

The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} in 
"Code to reproduce".
h1. Related tickets

https://issues.apache.org/jira/browse/SPARK-22748
h1. Relevant documentation
 * [Spark SQL GROUPBY, ROLLUP, and CUBE 
semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
 * 
[DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
 * 
[DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
 * 
[DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
 * 
[functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
 * 
[functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]

 


> Pyspark throws AnalysisException with incorrect error message when using 
> .grouping() or .groupingId() (AnalysisException: grouping() can only be used 
> with GroupingSets/Cube/Rollup;)
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38983
>                 URL: https://issues.apache.org/jira/browse/SPARK-38983
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.2, 3.2.1
>         Environment: I have reproduced this error in two environments. I 
> would be happy to answer questions about either.
> h1. Environment 1
> I first encountered this error on my employer's Azure Databricks cluster, 
> which runs Spark version 3.1.2. I have limited access to cluster 
> configuration information, but I can ask if it will help.
> h1. Environment 2
> I reproduced the error by running the same code in the Pyspark shell from 
> Spark 3.2.1 on my Chromebook (i.e. Crostini Linux). I have more access to 
> environment information here. Running {{spark-submit --version}} produced the 
> following output:
> {{Welcome to Spark version 3.2.1}}
> {{Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.14}}
> {{Branch HEAD}}
> {{Compiled by user hgao on 2022-01-20T19:26:14Z}}
> {{Revision 4f25b3f71238a00508a356591553f2dfa89f8290}}
> {{Url https://github.com/apache/spark}}
>            Reporter: Chris Kimmel
>            Priority: Minor
>              Labels: cube, error_message_improvement, exception-handling, 
> grouping, rollup
>
> h1. In a nutshell
> Pyspark emits an incorrect error message when committing a type error with 
> the results of the {{grouping()}} function.
> h1. Code to reproduce
> {{print(spark.version) # My environment, Azure DataBricks, defines spark 
> automatically.}}
> {{from pyspark.sql import functions as f}}
> {{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
> {{  ('a',),}}
> {{  ('b',),}}
> {{]}}
> {{s = t.StructType([}}
> {{  t.StructField('col1', t.StringType())}}
> {{])}}
> {{df = spark.createDataFrame(l, s)}}
> {{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
> {{  df}}
> {{  .cube(f.col('col1'))}}
> {{  .agg(f.grouping('col1') & f.lit(True))}}
> {{  .collect()}}
> {{)}}
> h1. Expected results
> The code produces an {{AnalysisException()}} with error message along the 
> lines of:
> {{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data 
> type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and 
> boolean).;}}
> h1. Actual results
> The code throws an {{AnalysisException()}} with error message
> {{AnalysisException: grouping() can only be used with 
> GroupingSets/Cube/Rollup;}}
> Python provides the following traceback:
> {{---------------------------------------------------------------------------}}
> {{AnalysisException                         Traceback (most recent call 
> last)}}
> {{<command-2283735107422632> in <module>}}
> {{     15 }}
> {{     16 ( # This expression raises an AnalysisException()}}
> {{---> 17   df}}
> {{     18   .cube(f.col('col1'))}}
> {{{}     19   .agg(f.grouping('col1') & 
> f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in 
> agg(self, *exprs){}}}
> {{    116             # Columns}}
> {{    117             assert all(isinstance(c, Column) for c in exprs), "all 
> exprs should be Column"}}
> {{--> 118             jdf = self._jgd.agg(exprs[0]._jc,}}
> {{    119                                 _to_seq(self.sql_ctx._sc, [c._jc 
> for c in exprs[1:]]))}}
> {{{}    120         return DataFrame(jdf, 
> self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in {_}{{_}}call{{_}}{_}(self, *args){}}}
> {{   1302 }}
> {{   1303         answer = self.gateway_client.send_command(command)}}
> {{-> 1304         return_value = get_return_value(}}
> {{   1305             answer, self.gateway_client, self.target_id, 
> self.name)}}
> {{   1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, 
> **kw)}}
> {{    121                 # Hide where the exception came from that shows a 
> non-Pythonic}}
> {{    122                 # JVM exception message.}}
> {{--> 123                 raise converted from None}}
> {{    124             else:}}
> {{{}    125                 raise{}}}{{{}AnalysisException: grouping() can 
> only be used with GroupingSets/Cube/Rollup;{}}}
> {{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) 
> AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS 
> (grouping(col1) AND true)#551]}}
> {{+- LogicalRDD [col1#548|#548], false}}
> h1. Workaround
> _Note:_ The reason I opened this ticket is that, when the user makes a 
> particular type error, the resulting error message is misleading. The code 
> snippet below shows how to fix that type error. It does not address the 
> false-error-message bug, which is the focus of this ticket.
> Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ 
> that {{.grouping()}} produces an integer 0 or 1 rather than a boolean True or 
> False.
> {{(  # This expression does not raise an AnalysisException()}}
> {{  df}}
> {{  .cube(f.col('col1'))}}
> {{  .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
> {{  .collect()}}
> {{)}}
> h1. Additional notes
> The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code 
> to reproduce".
> The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} 
> in "Code to reproduce".
> h1. Related tickets
> https://issues.apache.org/jira/browse/SPARK-22748
> h1. Relevant documentation
>  * [Spark SQL GROUPBY, ROLLUP, and CUBE 
> semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
>  * 
> [DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
>  * 
> [DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
>  * 
> [DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
>  * 
> [functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
>  * 
> [functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to