[
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ihor Bobak updated SPARK-32347:
-------------------------------
Description:
The bug is very easily reproduced: run the following same code in Spark 2.4.3.
and in 3.0.0.
The SQL parser will raise an invalid error message with 3.0.0, although
everything seems to be OK with the SQL statement and it works fine in Spark
2.4.3
{code:python}
import pandas as pd
pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID",
"BuyerName"])
df_sales = spark.createDataFrame(pdf_sales)
df_buyers = spark.createDataFrame(pdf_buyers)
df_sales.createOrReplaceTempView("df_sales")
df_buyers.createOrReplaceTempView("df_buyers")
spark.sql("""
with b as (
select /*+ BROADCAST(df_buyers) */
BuyerID, BuyerName
from df_buyers
)
select
b.BuyerID,
b.BuyerName,
s.Qty
from df_sales s
inner join b on s.BuyerID = b.BuyerID
""").toPandas()
{code}
was:
The bug is very easily reproduced: run the following same code in Spark 2.4.3.
and in 3.0.0.
The SQL parser will raise an invalid error message, although everything seems
to be OK with the SQL statement.
{code:python}
import pandas as pd
pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID",
"BuyerName"])
df_sales = spark.createDataFrame(pdf_sales)
df_buyers = spark.createDataFrame(pdf_buyers)
df_sales.createOrReplaceTempView("df_sales")
df_buyers.createOrReplaceTempView("df_buyers")
spark.sql("""
with b as (
select /*+ BROADCAST(df_buyers) */
BuyerID, BuyerName
from df_buyers
)
select
b.BuyerID,
b.BuyerName,
s.Qty
from df_sales s
inner join b on s.BuyerID = b.BuyerID
""").toPandas()
{code}
> BROADCAST hint makes a weird message that "column can't be resolved" (it was
> OK in Spark 2.4)
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in
> local[4] mode, but with Standalone cluster it also fails the same way.
>
>
> Reporter: Ihor Bobak
> Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although
> everything seems to be OK with the SQL statement and it works fine in Spark
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID",
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName
> from df_buyers
> )
> select
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID = b.BuyerID
> """).toPandas()
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]