[jira] [Commented] (SPARK-45722) False positive when cheking for ambigious columns

Alexey Dmitriev (Jira) Mon, 30 Oct 2023 04:27:03 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-45722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780965#comment-17780965
 ]


Alexey Dmitriev commented on SPARK-45722:
-----------------------------------------

turning off spark.sql.analyzer.failAmbiguousSelfJoin doesn't help, so probably 
issue is not exatcly where I think it was:
{code:java}
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
session = SparkSession.Builder().getOrCreate()
session.conf.set('spark.sql.analyzer.failAmbiguousSelfJoin', False)
A = session.createDataFrame([(1,)], ['a'])
B = session.createDataFrame([(1,)], ['b'])
A.join(B).select(B.b)
C = A.join(A.join(B), on=F.lit(False), how='leftanti')
C.join(B).select(B.b) {code}
AnalysisException: Resolved attribute(s) b#2L missing from a#0L,b#12L in 
operator !Project [b#2L]. Attribute(s) with the same name appear in the 
operation: b. Please check if the right attribute(s) are used.; !Project [b#2L] 
+- Join Inner :- Join LeftAnti, false : :- LogicalRDD [a#0L], false : +- Join 
Inner : :- LogicalRDD [a#9L], false : +- LogicalRDD [b#2L], false +- LogicalRDD 
[b#12L], false

> False positive when cheking for ambigious columns 
> --------------------------------------------------
>
>                 Key: SPARK-45722
>                 URL: https://issues.apache.org/jira/browse/SPARK-45722
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.0
>         Environment: py3.11 
> pyspark 3.4.0
>            Reporter: Alexey Dmitriev
>            Priority: Major
>
> I have following code, which I expect to work
> {code:java}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F session = 
> SparkSession.Builder().getOrCreate() A = session.createDataFrame([(1,)], 
> ['a'])
> B = session.createDataFrame([(1,)], ['b'])
> A.join(B).select(B.b) # works fine
> C = A.join(A.join(B), on=F.lit(False), how='leftanti') # C has the same 
> columns as A (columns, not only names)
> C.join(B).select(B.b) #doesn't work, says B.b is ambigious,
> {code}
> {code:java}
> Exception below:{code}
> {code:java}
> AnalysisException: Column b#11L are ambiguous. It's probably because you 
> joined several Datasets together, and some of these Datasets are the same. 
> This column points to one of the Datasets but Spark is unable to figure out 
> which one. Please alias the Datasets with different names via `Dataset.as` 
> before joining them, and specify the column using qualified name, e.g. 
> `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set 
> spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45722) False positive when cheking for ambigious columns

Reply via email to