[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join

Allison Wang (Jira) Wed, 11 Oct 2023 18:12:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-45509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Allison Wang updated SPARK-45509:
---------------------------------
    Description: 
SPARK-45220 discovers a behavior difference for a self-join scenario between 
classic Spark and Spark Connect.

For instance, here is the query that works without Spark Connect: 
{code:java}
df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)])
df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", 
height=85)]){code}
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show(){code}
But in Spark Connect, it throws this exception:
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
with name `name` cannot be resolved. Did you mean one of the following? 
[`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]
 {code}
 

On the other hand, this query failed in classic Spark Connect:
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
ambiguous... {code}
 

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

  was:
SPARK-45220 discovers a behavior difference for a self-join scenario between 
classic Spark and Spark Connect.

For instance, here is the query that works without Spark Connect: 
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show(){code}
But in Spark Connect, it throws this exception:
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
with name `name` cannot be resolved. Did you mean one of the following? 
[`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]
 {code}
 

On the other hand, this query failed in classic Spark Connect:
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
ambiguous... {code}
 

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

 


> Investigate the behavior difference in self-join
> ------------------------------------------------
>
>                 Key: SPARK-45509
>                 URL: https://issues.apache.org/jira/browse/SPARK-45509
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Connect, PySpark
>    Affects Versions: 3.5.0, 4.0.0
>            Reporter: Allison Wang
>            Priority: Major
>
> SPARK-45220 discovers a behavior difference for a self-join scenario between 
> classic Spark and Spark Connect.
> For instance, here is the query that works without Spark Connect: 
> {code:java}
> df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)])
> df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", 
> height=85)]){code}
> {code:java}
> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
> joined.show(){code}
> But in Spark Connect, it throws this exception:
> {code:java}
> pyspark.errors.exceptions.connect.AnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
> with name `name` cannot be resolved. Did you mean one of the following? 
> [`name`, `name`, `age`, `height`].;
> 'Sort ['name DESC NULLS LAST], true
> +- Join FullOuter, (name#64 = name#78)
>    :- LocalRelation [name#64, age#65L]
>    +- LocalRelation [name#78, height#79L]
>  {code}
>  
> On the other hand, this query failed in classic Spark Connect:
> {code:java}
> df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
> {code:java}
> pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
> ambiguous... {code}
>  
> but this query works with Spark Connect.
> We need to investigate the behavior difference and fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join

Reply via email to