[
https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Kimmel updated SPARK-38983:
---------------------------------
Description:
h1. In a nutshell
Pyspark emits an incorrect error message when committing a type error with the
results of the {{grouping()}} function.
h1. Code to reproduce
{{print(spark.version) # My environment, Azure DataBricks, defines spark
automatically.}}
{{from pyspark.sql import functions as f}}
{{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
{{ ('a',),}}
{{ ('b',),}}
{{]}}
{{s = t.StructType([}}
{{ t.StructField('col1', t.StringType())}}
{{])}}
{{df = spark.createDataFrame(l, s)}}
{{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
{{ df}}
{{ .cube(f.col('col1'))}}
{{ .agg(f.grouping('col1') & f.lit(True))}}
{{ .collect()}}
{{)}}
h1. Expected results
The code produces an {{AnalysisException()}} with error message along the lines
of:
{{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data
type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and
boolean).;}}
h1. Actual results
The code throws an {{AnalysisException()}} with error message
{{AnalysisException: grouping() can only be used with
GroupingSets/Cube/Rollup;}}
Python provides the following traceback:
{{---------------------------------------------------------------------------}}
{{AnalysisException Traceback (most recent call last)}}
{{<command-2283735107422632> in <module>}}
{{ 15 }}
{{ 16 ( # This expression raises an AnalysisException()}}
{{---> 17 df}}
{{ 18 .cube(f.col('col1'))}}
{{{} 19 .agg(f.grouping('col1') &
f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in agg(self,
*exprs){}}}
{{ 116 # Columns}}
{{ 117 assert all(isinstance(c, Column) for c in exprs), "all
exprs should be Column"}}
{{--> 118 jdf = self._jgd.agg(exprs[0]._jc,}}
{{ 119 _to_seq(self.sql_ctx._sc, [c._jc for
c in exprs[1:]]))}}
{{{} 120 return DataFrame(jdf,
self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
in {_}{{_}}call{{_}}{_}(self, *args){}}}
{{ 1302 }}
{{ 1303 answer = self.gateway_client.send_command(command)}}
{{-> 1304 return_value = get_return_value(}}
{{ 1305 answer, self.gateway_client, self.target_id, self.name)}}
{{ 1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}}
{{ 121 # Hide where the exception came from that shows a
non-Pythonic}}
{{ 122 # JVM exception message.}}
{{--> 123 raise converted from None}}
{{ 124 else:}}
{{{} 125 raise{}}}{{{}AnalysisException: grouping() can only
be used with GroupingSets/Cube/Rollup;{}}}
{{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true)
AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS
(grouping(col1) AND true)#551]}}
{{+- LogicalRDD [col1#548|#548], false}}
h1. Workaround
_Note:_ The reason I opened this ticket is that, when the user makes a
particular type error, the resulting error message is misleading. The code
snippet below shows how to fix that type error. It does not address the
false-error-message bug, which is the focus of this ticket.
Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ that
{{.grouping()}} produces an integer 0 or 1 rather than a boolean True or False.
{{( # This expression does not raise an AnalysisException()}}
{{ df}}
{{ .cube(f.col('col1'))}}
{{ .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
{{ .collect()}}
{{)}}
h1. Additional notes
The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code to
reproduce".
The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} in
"Code to reproduce".
h1. Related tickets
https://issues.apache.org/jira/browse/SPARK-22748
h1. Relevant documentation
* [Spark SQL GROUPBY, ROLLUP, and CUBE
semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
*
[DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
*
[DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
*
[DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
*
[functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
*
[functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]
was:
h1. Code to reproduce
{{print(spark.version) # My environment, Azure DataBricks, defines spark
automatically.}}
{{from pyspark.sql import functions as f}}
{{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
{{ ('a',),}}
{{ ('b',),}}
{{]}}
{{s = t.StructType([}}
{{ t.StructField('col1', t.StringType())}}
{{])}}
{{df = spark.createDataFrame(l, s)}}
{{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
{{ df}}
{{ .cube(f.col('col1'))}}
{{ .agg(f.grouping('col1') & f.lit(True))}}
{{ .collect()}}
{{)}}
h1. Expected results
The code produces an {{AnalysisException()}} with error message along the lines
of:
{{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data
type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and
boolean).;}}
h1. Actual results
The code throws an {{AnalysisException()}} with error message
{{AnalysisException: grouping() can only be used with
GroupingSets/Cube/Rollup;}}
Python provides the following traceback:
{{---------------------------------------------------------------------------}}
{{AnalysisException Traceback (most recent call last)}}
{{<command-2283735107422632> in <module>}}
{{ 15 }}
{{ 16 ( # This expression raises an AnalysisException()}}
{{---> 17 df}}
{{ 18 .cube(f.col('col1'))}}
{{{} 19 .agg(f.grouping('col1') &
f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in agg(self,
*exprs){}}}
{{ 116 # Columns}}
{{ 117 assert all(isinstance(c, Column) for c in exprs), "all
exprs should be Column"}}
{{--> 118 jdf = self._jgd.agg(exprs[0]._jc,}}
{{ 119 _to_seq(self.sql_ctx._sc, [c._jc for
c in exprs[1:]]))}}
{{{} 120 return DataFrame(jdf,
self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
in _{_}call{_}_(self, *args){}}}
{{ 1302 }}
{{ 1303 answer = self.gateway_client.send_command(command)}}
{{-> 1304 return_value = get_return_value(}}
{{ 1305 answer, self.gateway_client, self.target_id, self.name)}}
{{ 1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}}
{{ 121 # Hide where the exception came from that shows a
non-Pythonic}}
{{ 122 # JVM exception message.}}
{{--> 123 raise converted from None}}
{{ 124 else:}}
{{{} 125 raise{}}}{{{}AnalysisException: grouping() can only
be used with GroupingSets/Cube/Rollup;{}}}
{{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true)
AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS
(grouping(col1) AND true)#551]}}
{{+- LogicalRDD [col1#548|#548], false}}
h1. Workaround
Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ that
{{.grouping()}} produces an integer 0 or 1 rather than a boolean True or False.
{{( # This expression does not raise an AnalysisException()}}
{{ df}}
{{ .cube(f.col('col1'))}}
{{ .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
{{ .collect()}}
{{)}}
h1. Additional notes
The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code to
reproduce".
The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} in
"Code to reproduce".
h1. Related tickets
https://issues.apache.org/jira/browse/SPARK-22748
h1. Relevant documentation
* [Spark SQL GROUPBY, ROLLUP, and CUBE
semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
*
[DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
*
[DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
*
[DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
*
[functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
*
[functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]
> Pyspark throws AnalysisException with incorrect error message when using
> .grouping() or .groupingId() (AnalysisException: grouping() can only be used
> with GroupingSets/Cube/Rollup;)
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-38983
> URL: https://issues.apache.org/jira/browse/SPARK-38983
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.1.2, 3.2.1
> Environment: I have reproduced this error in two environments. I
> would be happy to answer questions about either.
> h1. Environment 1
> I first encountered this error on my employer's Azure Databricks cluster,
> which runs Spark version 3.1.2. I have limited access to cluster
> configuration information, but I can ask if it will help.
> h1. Environment 2
> I reproduced the error by running the same code in the Pyspark shell from
> Spark 3.2.1 on my Chromebook (i.e. Crostini Linux). I have more access to
> environment information here. Running {{spark-submit --version}} produced the
> following output:
> {{Welcome to Spark version 3.2.1}}
> {{Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.14}}
> {{Branch HEAD}}
> {{Compiled by user hgao on 2022-01-20T19:26:14Z}}
> {{Revision 4f25b3f71238a00508a356591553f2dfa89f8290}}
> {{Url https://github.com/apache/spark}}
> Reporter: Chris Kimmel
> Priority: Minor
> Labels: cube, error_message_improvement, exception-handling,
> grouping, rollup
>
> h1. In a nutshell
> Pyspark emits an incorrect error message when committing a type error with
> the results of the {{grouping()}} function.
> h1. Code to reproduce
> {{print(spark.version) # My environment, Azure DataBricks, defines spark
> automatically.}}
> {{from pyspark.sql import functions as f}}
> {{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
> {{ ('a',),}}
> {{ ('b',),}}
> {{]}}
> {{s = t.StructType([}}
> {{ t.StructField('col1', t.StringType())}}
> {{])}}
> {{df = spark.createDataFrame(l, s)}}
> {{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
> {{ df}}
> {{ .cube(f.col('col1'))}}
> {{ .agg(f.grouping('col1') & f.lit(True))}}
> {{ .collect()}}
> {{)}}
> h1. Expected results
> The code produces an {{AnalysisException()}} with error message along the
> lines of:
> {{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data
> type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and
> boolean).;}}
> h1. Actual results
> The code throws an {{AnalysisException()}} with error message
> {{AnalysisException: grouping() can only be used with
> GroupingSets/Cube/Rollup;}}
> Python provides the following traceback:
> {{---------------------------------------------------------------------------}}
> {{AnalysisException Traceback (most recent call
> last)}}
> {{<command-2283735107422632> in <module>}}
> {{ 15 }}
> {{ 16 ( # This expression raises an AnalysisException()}}
> {{---> 17 df}}
> {{ 18 .cube(f.col('col1'))}}
> {{{} 19 .agg(f.grouping('col1') &
> f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in
> agg(self, *exprs){}}}
> {{ 116 # Columns}}
> {{ 117 assert all(isinstance(c, Column) for c in exprs), "all
> exprs should be Column"}}
> {{--> 118 jdf = self._jgd.agg(exprs[0]._jc,}}
> {{ 119 _to_seq(self.sql_ctx._sc, [c._jc
> for c in exprs[1:]]))}}
> {{{} 120 return DataFrame(jdf,
> self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
> in {_}{{_}}call{{_}}{_}(self, *args){}}}
> {{ 1302 }}
> {{ 1303 answer = self.gateway_client.send_command(command)}}
> {{-> 1304 return_value = get_return_value(}}
> {{ 1305 answer, self.gateway_client, self.target_id,
> self.name)}}
> {{ 1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a,
> **kw)}}
> {{ 121 # Hide where the exception came from that shows a
> non-Pythonic}}
> {{ 122 # JVM exception message.}}
> {{--> 123 raise converted from None}}
> {{ 124 else:}}
> {{{} 125 raise{}}}{{{}AnalysisException: grouping() can
> only be used with GroupingSets/Cube/Rollup;{}}}
> {{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true)
> AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS
> (grouping(col1) AND true)#551]}}
> {{+- LogicalRDD [col1#548|#548], false}}
> h1. Workaround
> _Note:_ The reason I opened this ticket is that, when the user makes a
> particular type error, the resulting error message is misleading. The code
> snippet below shows how to fix that type error. It does not address the
> false-error-message bug, which is the focus of this ticket.
> Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_
> that {{.grouping()}} produces an integer 0 or 1 rather than a boolean True or
> False.
> {{( # This expression does not raise an AnalysisException()}}
> {{ df}}
> {{ .cube(f.col('col1'))}}
> {{ .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
> {{ .collect()}}
> {{)}}
> h1. Additional notes
> The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code
> to reproduce".
> The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}}
> in "Code to reproduce".
> h1. Related tickets
> https://issues.apache.org/jira/browse/SPARK-22748
> h1. Relevant documentation
> * [Spark SQL GROUPBY, ROLLUP, and CUBE
> semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
> *
> [DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
> *
> [DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
> *
> [DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
> *
> [functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
> *
> [functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]