[jira] [Updated] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

Ihor Bobak (Jira) Fri, 17 Jul 2020 08:16:15 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ihor Bobak updated SPARK-32347:
-------------------------------
    Description: 
The bug is very easily reproduced: run the following same code in Spark 2.4.3. 
and in 3.0.0.

The SQL parser will raise an invalid error message with 3.0.0, although 
everything seems to be OK with the SQL statement and it works fine in Spark 
2.4.3
{code:python}
import pandas as pd

pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
"BuyerName"])

df_sales = spark.createDataFrame(pdf_sales)
df_buyers = spark.createDataFrame(pdf_buyers)

df_sales.createOrReplaceTempView("df_sales")
df_buyers.createOrReplaceTempView("df_buyers")

spark.sql("""
    with b as (
        select /*+ BROADCAST(df_buyers) */
            BuyerID, BuyerName 
        from df_buyers
    )
    select 
        b.BuyerID,
        b.BuyerName,
        s.Qty
    from df_sales s
        inner join b on s.BuyerID =  b.BuyerID
""").toPandas()
{code}

The (wrong) error message:
---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-4-8dfe318a59ee> in <module>
     22     from df_sales s
     23         inner join b on s.BuyerID =  b.BuyerID
---> 24 """).toPandas()

/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in sql(self, 
sqlQuery)
    644         [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
f2=u'row3')]
    645         """
--> 646         return DataFrame(self._jsparkSession.sql(sqlQuery), 
self._wrapped)
    647 
    648     @since(2.0)

/opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
**kw)
    135                 # Hide where the exception came from that shows a 
non-Pythonic
    136                 # JVM exception message.
--> 137                 raise_from(converted)
    138             else:
    139                 raise

/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in raise_from(e)

AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
[s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
+- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
   :- SubqueryAlias s
   :  +- SubqueryAlias df_sales
   :     +- LogicalRDD [BuyerID#23L, Qty#24L], false
   +- SubqueryAlias b
      +- Project [BuyerID#27L, BuyerName#28]
         +- SubqueryAlias df_buyers
            +- LogicalRDD [BuyerID#27L, BuyerName#28], false


  was:
The bug is very easily reproduced: run the following same code in Spark 2.4.3. 
and in 3.0.0.

The SQL parser will raise an invalid error message with 3.0.0, although 
everything seems to be OK with the SQL statement and it works fine in Spark 
2.4.3
{code:python}
import pandas as pd

pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
"BuyerName"])

df_sales = spark.createDataFrame(pdf_sales)
df_buyers = spark.createDataFrame(pdf_buyers)

df_sales.createOrReplaceTempView("df_sales")
df_buyers.createOrReplaceTempView("df_buyers")

spark.sql("""
    with b as (
        select /*+ BROADCAST(df_buyers) */
            BuyerID, BuyerName 
        from df_buyers
    )
    select 
        b.BuyerID,
        b.BuyerName,
        s.Qty
    from df_sales s
        inner join b on s.BuyerID =  b.BuyerID
""").toPandas()
{code}


> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32347
>                 URL: https://issues.apache.org/jira/browse/SPARK-32347
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>         Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>            Reporter: Ihor Bobak
>            Priority: Major
>             Fix For: 3.0.1
>
>         Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
>     with b as (
>         select /*+ BROADCAST(df_buyers) */
>             BuyerID, BuyerName 
>         from df_buyers
>     )
>     select 
>         b.BuyerID,
>         b.BuyerName,
>         s.Qty
>     from df_sales s
>         inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---------------------------------------------------------------------------
> AnalysisException                         Traceback (most recent call last)
> <ipython-input-4-8dfe318a59ee> in <module>
>      22     from df_sales s
>      23         inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
>     644         [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
>     645         """
> --> 646         return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
>     647 
>     648     @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>    1303         answer = self.gateway_client.send_command(command)
>    1304         return_value = get_return_value(
> -> 1305             answer, self.gateway_client, self.target_id, self.name)
>    1306 
>    1307         for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
>     135                 # Hide where the exception came from that shows a 
> non-Pythonic
>     136                 # JVM exception message.
> --> 137                 raise_from(converted)
>     138             else:
>     139                 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>    :- SubqueryAlias s
>    :  +- SubqueryAlias df_sales
>    :     +- LogicalRDD [BuyerID#23L, Qty#24L], false
>    +- SubqueryAlias b
>       +- Project [BuyerID#27L, BuyerName#28]
>          +- SubqueryAlias df_buyers
>             +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

Reply via email to