[jira] [Commented] (SPARK-6231) Join on two tables (generated from same one) is broken

Michael Armbrust (JIRA) Sun, 22 Mar 2015 12:46:21 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375130#comment-14375130
 ]


Michael Armbrust commented on SPARK-6231:
-----------------------------------------

After a bunch of investigation I'm going to bump this to 1.4 as fixing it is 
going to require significant changes to analysis.  The query is actually 
running correctly.  The problem is that {{txns("cust_id") === 
spend("cust_id")}} is always {{true}} as there is no way for the optimizer to 
differentiate these columns since they are actually identical.  We would need 
to change the DataFrame API to construct new expression IDs for each operation 
to handle this case correctly.  For now, use the workaround specified above 
with aliases.

Additionally, [SPARK-6459] will give a warning when users hit this case

> Join on two tables (generated from same one) is broken
> ------------------------------------------------------
>
>                 Key: SPARK-6231
>                 URL: https://issues.apache.org/jira/browse/SPARK-6231
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Davies Liu
>            Assignee: Michael Armbrust
>            Priority: Blocker
>              Labels: DataFrame
>
> If the two column used in joinExpr come from the same table, they have the 
> same id, then the joniExpr is explained in wrong way.
> {code}
> val df = sqlContext.load(path, "parquet")
> val txns = df.groupBy("cust_id").agg($"cust_id", 
> countDistinct($"day_num").as("txns"))
> val spend = df.groupBy("cust_id").agg($"cust_id", 
> sum($"extended_price").as("spend"))
> val rmJoin = txns.join(spend, txns("cust_id") === spend("cust_id"), "inner")
> scala> rmJoin.explain
> == Physical Plan ==
> CartesianProduct
>  Filter (cust_id#0 = cust_id#0)
>   Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS 
> txns#7L]
>    Exchange (HashPartitioning [cust_id#0], 200)
>     Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS 
> partialSets#25]
>      PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at 
> newParquet.scala:542
>  Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8]
>   Exchange (HashPartitioning [cust_id#17], 200)
>    Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS 
> PartialSum#38]
>     PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at 
> newParquet.scala:542
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6231) Join on two tables (generated from same one) is broken

Reply via email to