[
https://issues.apache.org/jira/browse/SPARK-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074322#comment-15074322
]
Maciej Szymkiewicz commented on SPARK-6459:
-------------------------------------------
I've been trying to reproduce the problem on 1.5.2 to illustrate why there is a
need for aliases but surprisingly it worked just fine.
{code}
val df = sc.parallelize(Seq(("a", 1, 0.2), ("a", 2, 0.3), ("b", 2, 0.4), ("b",
3, 0.5))).toDF("x", "y", "z")
val as = df.where($"x" === "a")
val bs = df.where($"x" === "b")
as.join(bs, as("y") === bs("y")).collect
{code}
I get a warning but no Cartesian product.
{code}
scala> as.join(bs, as("y") === bs("y")).explain(true)
15/12/29 21:29:16 WARN Column: Constructing trivially true equals predicate,
'y#4 = y#4'. Perhaps you need to use aliases.
== Parsed Logical Plan ==
Join Inner, Some((y#4 = y#17))
Filter (x#3 = a)
Project [_1#0 AS x#3,_2#1 AS y#4,_3#2 AS z#5]
LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at
<console>:21
Filter (x#16 = b)
Project [_1#0 AS x#16,_2#1 AS y#17,_3#2 AS z#18]
LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at
<console>:21
== Analyzed Logical Plan ==
x: string, y: int, z: double, x: string, y: int, z: double
Join Inner, Some((y#4 = y#17))
Filter (x#3 = a)
Project [_1#0 AS x#3,_2#1 AS y#4,_3#2 AS z#5]
LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at
<console>:21
Filter (x#16 = b)
Project [_1#0 AS x#16,_2#1 AS y#17,_3#2 AS z#18]
LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at
<console>:21
== Optimized Logical Plan ==
Join Inner, Some((y#4 = y#17))
Project [_1#0 AS x#3,_2#1 AS y#4,_3#2 AS z#5]
Filter (_1#0 = a)
LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at
<console>:21
Project [_1#0 AS x#16,_2#1 AS y#17,_3#2 AS z#18]
Filter (_1#0 = b)
LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at
<console>:21
== Physical Plan ==
SortMergeJoin [y#4], [y#17]
TungstenSort [y#4 ASC], false, 0
TungstenExchange hashpartitioning(y#4)
TungstenProject [_1#0 AS x#3,_2#1 AS y#4,_3#2 AS z#5]
Filter (_1#0 = a)
Scan PhysicalRDD[_1#0,_2#1,_3#2]
TungstenSort [y#17 ASC], false, 0
TungstenExchange hashpartitioning(y#17)
TungstenProject [_1#0 AS x#16,_2#1 AS y#17,_3#2 AS z#18]
Filter (_1#0 = b)
Scan PhysicalRDD[_1#0,_2#1,_3#2]
Code Generation: true
{code}
> Warn when Column API is constructing trivially true equality
> ------------------------------------------------------------
>
> Key: SPARK-6459
> URL: https://issues.apache.org/jira/browse/SPARK-6459
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.3.0
> Reporter: Michael Armbrust
> Assignee: Michael Armbrust
> Priority: Critical
> Fix For: 1.3.1, 1.4.0
>
>
> Right now its pretty confusing when a user constructs and equality predicate
> that is going to be use in a self join, where the optimizer cannot
> distinguish between the attributes in question (e.g., [SPARK-6231]). Since
> there is really no good reason to do this, lets print a warning.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]