[ 
https://issues.apache.org/jira/browse/SPARK-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074322#comment-15074322
 ] 

Maciej Szymkiewicz commented on SPARK-6459:
-------------------------------------------

I've been trying to reproduce the problem on 1.5.2 to illustrate why there is a 
need for aliases but surprisingly it worked just fine. 

{code}
val df = sc.parallelize(Seq(("a", 1, 0.2), ("a", 2, 0.3), ("b", 2, 0.4), ("b", 
3, 0.5))).toDF("x", "y", "z")
val as = df.where($"x" === "a")
val bs = df.where($"x" === "b")
as.join(bs, as("y") === bs("y")).collect
{code}

I get a warning but no Cartesian product. 

{code}
scala> as.join(bs, as("y") === bs("y")).explain(true)
15/12/29 21:29:16 WARN Column: Constructing trivially true equals predicate, 
'y#4 = y#4'. Perhaps you need to use aliases.
== Parsed Logical Plan ==
Join Inner, Some((y#4 = y#17))
 Filter (x#3 = a)
  Project [_1#0 AS x#3,_2#1 AS y#4,_3#2 AS z#5]
   LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at 
<console>:21
 Filter (x#16 = b)
  Project [_1#0 AS x#16,_2#1 AS y#17,_3#2 AS z#18]
   LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at 
<console>:21

== Analyzed Logical Plan ==
x: string, y: int, z: double, x: string, y: int, z: double
Join Inner, Some((y#4 = y#17))
 Filter (x#3 = a)
  Project [_1#0 AS x#3,_2#1 AS y#4,_3#2 AS z#5]
   LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at 
<console>:21
 Filter (x#16 = b)
  Project [_1#0 AS x#16,_2#1 AS y#17,_3#2 AS z#18]
   LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at 
<console>:21

== Optimized Logical Plan ==
Join Inner, Some((y#4 = y#17))
 Project [_1#0 AS x#3,_2#1 AS y#4,_3#2 AS z#5]
  Filter (_1#0 = a)
   LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at 
<console>:21
 Project [_1#0 AS x#16,_2#1 AS y#17,_3#2 AS z#18]
  Filter (_1#0 = b)
   LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at 
<console>:21

== Physical Plan ==
SortMergeJoin [y#4], [y#17]
 TungstenSort [y#4 ASC], false, 0
  TungstenExchange hashpartitioning(y#4)
   TungstenProject [_1#0 AS x#3,_2#1 AS y#4,_3#2 AS z#5]
    Filter (_1#0 = a)
     Scan PhysicalRDD[_1#0,_2#1,_3#2]
 TungstenSort [y#17 ASC], false, 0
  TungstenExchange hashpartitioning(y#17)
   TungstenProject [_1#0 AS x#16,_2#1 AS y#17,_3#2 AS z#18]
    Filter (_1#0 = b)
     Scan PhysicalRDD[_1#0,_2#1,_3#2]

Code Generation: true
{code}

> Warn when Column API is constructing trivially true equality
> ------------------------------------------------------------
>
>                 Key: SPARK-6459
>                 URL: https://issues.apache.org/jira/browse/SPARK-6459
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Michael Armbrust
>            Assignee: Michael Armbrust
>            Priority: Critical
>             Fix For: 1.3.1, 1.4.0
>
>
> Right now its pretty confusing when a user constructs and equality predicate 
> that is going to be use in a self join, where the optimizer cannot 
> distinguish between the attributes in question (e.g.,  [SPARK-6231]).  Since 
> there is really no good reason to do this, lets print a warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to