[ https://issues.apache.org/jira/browse/SPARK-45722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780965#comment-17780965 ]
Alexey Dmitriev commented on SPARK-45722: ----------------------------------------- turning off spark.sql.analyzer.failAmbiguousSelfJoin doesn't help, so probably issue is not exatcly where I think it was: {code:java} from pyspark.sql import SparkSession import pyspark.sql.functions as F session = SparkSession.Builder().getOrCreate() session.conf.set('spark.sql.analyzer.failAmbiguousSelfJoin', False) A = session.createDataFrame([(1,)], ['a']) B = session.createDataFrame([(1,)], ['b']) A.join(B).select(B.b) C = A.join(A.join(B), on=F.lit(False), how='leftanti') C.join(B).select(B.b) {code} AnalysisException: Resolved attribute(s) b#2L missing from a#0L,b#12L in operator !Project [b#2L]. Attribute(s) with the same name appear in the operation: b. Please check if the right attribute(s) are used.; !Project [b#2L] +- Join Inner :- Join LeftAnti, false : :- LogicalRDD [a#0L], false : +- Join Inner : :- LogicalRDD [a#9L], false : +- LogicalRDD [b#2L], false +- LogicalRDD [b#12L], false > False positive when cheking for ambigious columns > -------------------------------------------------- > > Key: SPARK-45722 > URL: https://issues.apache.org/jira/browse/SPARK-45722 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.4.0 > Environment: py3.11 > pyspark 3.4.0 > Reporter: Alexey Dmitriev > Priority: Major > > I have following code, which I expect to work > {code:java} > from pyspark.sql import SparkSession > import pyspark.sql.functions as F session = > SparkSession.Builder().getOrCreate() A = session.createDataFrame([(1,)], > ['a']) > B = session.createDataFrame([(1,)], ['b']) > A.join(B).select(B.b) # works fine > C = A.join(A.join(B), on=F.lit(False), how='leftanti') # C has the same > columns as A (columns, not only names) > C.join(B).select(B.b) #doesn't work, says B.b is ambigious, > {code} > {code:java} > Exception below:{code} > {code:java} > AnalysisException: Column b#11L are ambiguous. It's probably because you > joined several Datasets together, and some of these Datasets are the same. > This column points to one of the Datasets but Spark is unable to figure out > which one. Please alias the Datasets with different names via `Dataset.as` > before joining them, and specify the column using qualified name, e.g. > `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set > spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org