[ 
https://issues.apache.org/jira/browse/SPARK-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209503#comment-15209503
 ] 

Sunitha Kambhampati edited comment on SPARK-14040 at 3/24/16 12:48 AM:
-----------------------------------------------------------------------

Here are my notes on the investigation. 
{noformat}
A smaller repro: 

  test("test nullsafe3 - Wrong results 14040") {
    val b1 = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
    val a1 = b1.where("c = 1")
    a1.printSchema()
    a1.join(b1, a1("c") <=> b1("c"), "left_outer").explain(true)
    a1.join(b1, a1("c") <=> b1("c"), "left_outer").show()
  }
{noformat}  

* The ParsedLogicalPlan resolves the column in the join condition to the *same* 
causing the issue.

* Note, if you use the === in the join condition, the results are correct 
because there is special casing logic in join in the DataSet to reanalyze and 
thus it avoids the problem.

* One way to fix this is to add the special case for the EqualNullSafe in join 
in DataSet.  I have added it and the testcase works fine. 
* But that said, there is a general fundamental problem that the resolution of 
the column is incorrect when the column names are same.  


was (Author: ksunitha):
Here are my notes on the investigation. 
{noformat}
A smaller repro: 

  test("test nullsafe3 - Wrong results 14040") {
    val b1 = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
    val a1 = b1.where("c = 1")
    a1.printSchema()
    a1.join(b1, a1("c") <=> b1("c"), "left_outer").explain(true)
    a1.join(b1, a1("c") <=> b1("c"), "left_outer").show()
  }

The ParsedLogicalPlan resolves the column in the join condition to the same 
causing the issue.

Note, if you use the === , the results are correct because there is special 
casing logic in join in the DataSet to reanalyze and thus it avoids the problem.

One way to fix this is to add the special case for the EqualNullSafe in join in 
DataSet.  I have added it and the testcase works fine. But that said, there is 
a general fundamental problem that the resolution of the column is incorrect 
when the column names are same.  

{noformat}  

> Null-safe and equality join produces incorrect result with filtered dataframe
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-14040
>                 URL: https://issues.apache.org/jira/browse/SPARK-14040
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: Ubuntu Linux 15.10
>            Reporter: Denton Cockburn
>
> Initial issue reported here: 
> http://stackoverflow.com/questions/36131942/spark-join-produces-wrong-results
>       val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
>       val a = b.where("c = 1").withColumnRenamed("a", 
> "filta").withColumnRenamed("b", "filtb")
>       a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> 
> b("c"), "left_outer").show
> Produces 2 rows instead of the expected 1.
>       a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === 
> $"b" and $"newc" === b("c"), "left_outer").show
> Also produces 2 rows instead of the expected 1.
> The only one that seemed to work correctly was:
>       a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === 
> b("c"), "left_outer").show
> But that produced a warning for :  
>       WARN Column: Constructing trivially true equals predicate, 'c#18232 = 
> c#18232' 
> As pointed out by commenter zero323:
> "The second behavior looks indeed like a bug related to the fact that you 
> still have a.c in your data. It looks like it is picked downstream before b.c 
> and the evaluated condition is actually a.newc = a.c"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to