[
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ihor Bobak updated SPARK-32347:
-------------------------------
Description:
The bug is very easily reproduced: run the following same code in Spark 2.4.3.
and in 3.0.0.
The SQL parser will raise an invalid error message with 3.0.0, although
everything seems to be OK with the SQL statement and it works fine in Spark
2.4.3
{code:python}
import pandas as pd
pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID",
"BuyerName"])
df_sales = spark.createDataFrame(pdf_sales)
df_buyers = spark.createDataFrame(pdf_buyers)
df_sales.createOrReplaceTempView("df_sales")
df_buyers.createOrReplaceTempView("df_buyers")
spark.sql("""
with b as (
select /*+ BROADCAST(df_buyers) */
BuyerID, BuyerName
from df_buyers
)
select
b.BuyerID,
b.BuyerName,
s.Qty
from df_sales s
inner join b on s.BuyerID = b.BuyerID
""").toPandas()
{code}
The (wrong) error message:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-4-8dfe318a59ee> in <module>
22 from df_sales s
23 inner join b on s.BuyerID = b.BuyerID
---> 24 """).toPandas()
/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in sql(self,
sqlQuery)
644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3,
f2=u'row3')]
645 """
--> 646 return DataFrame(self._jsparkSession.sql(sqlQuery),
self._wrapped)
647
648 @since(2.0)
/opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a,
**kw)
135 # Hide where the exception came from that shows a
non-Pythonic
136 # JVM exception message.
--> 137 raise_from(converted)
138 else:
139 raise
/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in raise_from(e)
AnalysisException: cannot resolve '`s.BuyerID`' given input columns:
[s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
+- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
:- SubqueryAlias s
: +- SubqueryAlias df_sales
: +- LogicalRDD [BuyerID#23L, Qty#24L], false
+- SubqueryAlias b
+- Project [BuyerID#27L, BuyerName#28]
+- SubqueryAlias df_buyers
+- LogicalRDD [BuyerID#27L, BuyerName#28], false
was:
The bug is very easily reproduced: run the following same code in Spark 2.4.3.
and in 3.0.0.
The SQL parser will raise an invalid error message with 3.0.0, although
everything seems to be OK with the SQL statement and it works fine in Spark
2.4.3
{code:python}
import pandas as pd
pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID",
"BuyerName"])
df_sales = spark.createDataFrame(pdf_sales)
df_buyers = spark.createDataFrame(pdf_buyers)
df_sales.createOrReplaceTempView("df_sales")
df_buyers.createOrReplaceTempView("df_buyers")
spark.sql("""
with b as (
select /*+ BROADCAST(df_buyers) */
BuyerID, BuyerName
from df_buyers
)
select
b.BuyerID,
b.BuyerName,
s.Qty
from df_sales s
inner join b on s.BuyerID = b.BuyerID
""").toPandas()
{code}
> BROADCAST hint makes a weird message that "column can't be resolved" (it was
> OK in Spark 2.4)
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in
> local[4] mode, but with Standalone cluster it also fails the same way.
>
>
> Reporter: Ihor Bobak
> Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although
> everything seems to be OK with the SQL statement and it works fine in Spark
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID",
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName
> from df_buyers
> )
> select
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID = b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---------------------------------------------------------------------------
> AnalysisException Traceback (most recent call last)
> <ipython-input-4-8dfe318a59ee> in <module>
> 22 from df_sales s
> 23 inner join b on s.BuyerID = b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3,
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery),
> self._wrapped)
> 647
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
> in __call__(self, *args)
> 1303 answer = self.gateway_client.send_command(command)
> 1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
> 1306
> 1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a,
> **kw)
> 135 # Hide where the exception came from that shows a
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns:
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
> :- SubqueryAlias s
> : +- SubqueryAlias df_sales
> : +- LogicalRDD [BuyerID#23L, Qty#24L], false
> +- SubqueryAlias b
> +- Project [BuyerID#27L, BuyerName#28]
> +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]