[ https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209503#comment-15209503 ]
Sunitha Kambhampati edited comment on SPARK-14040 at 3/24/16 12:48 AM: ----------------------------------------------------------------------- Here are my notes on the investigation. {noformat} A smaller repro: test("test nullsafe3 - Wrong results 14040") { val b1 = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c") val a1 = b1.where("c = 1") a1.printSchema() a1.join(b1, a1("c") <=> b1("c"), "left_outer").explain(true) a1.join(b1, a1("c") <=> b1("c"), "left_outer").show() } {noformat} * The ParsedLogicalPlan resolves the column in the join condition to the *same* causing the issue. * Note, if you use the === in the join condition, the results are correct because there is special casing logic in join in the DataSet to reanalyze and thus it avoids the problem. * One way to fix this is to add the special case for the EqualNullSafe in join in DataSet. I have added it and the testcase works fine. * But that said, there is a general fundamental problem that the resolution of the column is incorrect when the column names are same. was (Author: ksunitha): Here are my notes on the investigation. {noformat} A smaller repro: test("test nullsafe3 - Wrong results 14040") { val b1 = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c") val a1 = b1.where("c = 1") a1.printSchema() a1.join(b1, a1("c") <=> b1("c"), "left_outer").explain(true) a1.join(b1, a1("c") <=> b1("c"), "left_outer").show() } The ParsedLogicalPlan resolves the column in the join condition to the same causing the issue. Note, if you use the === , the results are correct because there is special casing logic in join in the DataSet to reanalyze and thus it avoids the problem. One way to fix this is to add the special case for the EqualNullSafe in join in DataSet. I have added it and the testcase works fine. But that said, there is a general fundamental problem that the resolution of the column is incorrect when the column names are same. {noformat} > Null-safe and equality join produces incorrect result with filtered dataframe > ----------------------------------------------------------------------------- > > Key: SPARK-14040 > URL: https://issues.apache.org/jira/browse/SPARK-14040 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.0 > Environment: Ubuntu Linux 15.10 > Reporter: Denton Cockburn > > Initial issue reported here: > http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results > val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c") > val a = b.where("c = 1").withColumnRenamed("a", > "filta").withColumnRenamed("b", "filtb") > a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> > b("c"), "left_outer").show > Produces 2 rows instead of the expected 1. > a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === > $"b" and $"newc" === b("c"), "left_outer").show > Also produces 2 rows instead of the expected 1. > The only one that seemed to work correctly was: > a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === > b("c"), "left_outer").show > But that produced a warning for : > WARN Column: Constructing trivially true equals predicate, 'c#18232 = > c#18232' > As pointed out by commenter zero323: > "The second behavior looks indeed like a bug related to the fact that you > still have a.c in your data. It looks like it is picked downstream before b.c > and the evaluated condition is actually a.newc = a.c" -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org