[jira] [Comment Edited] (SPARK-6231) Join on two tables (generated from same one) is broken

Reynold Xin (JIRA) Sat, 21 Mar 2015 18:18:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14373008#comment-14373008
 ]


Reynold Xin edited comment on SPARK-6231 at 3/22/15 1:17 AM:
-------------------------------------------------------------

Just wanted to give some more info. The below query was working fine till 1.3 
commit 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.
It got broken in 1.3.0 release

{code}
select Orders.Country, Orders.ProductCategory,count(1) from Orders join (select 
Orders.Country, count(1) CountryOrderCount from Orders where 
to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by 
CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by 
Orders.Country,Orders.ProductCategory;
{code}

The temporary workaround is to add explicit alias for the table Orders

{code}
select o.Country, o.ProductCategory,count(1) from Orders o join (select 
r.Country, count(1) CountryOrderCount from Orders r where to_date(r.PlacedDate) 
> '2015-01-01' group by r.Country order by CountryOrderCount DESC LIMIT 5) 
Top5Countries on Top5Countries.Country = o.Country where to_date(o.PlacedDate) 
> '2015-01-01' group by o.Country,o.ProductCategory;
{code}

However this change not only affects self joins, it also seems to affect union 
queries as well, like the below query which was again working before(commit 
9a151ce) got broken

{code}
select Orders.Country,null,count(1) OrderCount from Orders group by 
Orders.Country,null
union all 
select null,Orders.ProductCategory,count(1) OrderCount from Orders group by 
null, Orders.ProductCategory
{code}
also fails with a Analysis exception.
The workaround is to add different aliases for the tables.


was (Author: chinnitv):
Just wanted to give some more info. The below query was working fine till 1.3 
commit 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.
It got broken in 1.3.0 release

select Orders.Country, Orders.ProductCategory,count(1) from Orders join (select 
Orders.Country, count(1) CountryOrderCount from Orders where 
to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by 
CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by 
Orders.Country,Orders.ProductCategory;

The temporary workaround is to add explicit alias for the table Orders

select o.Country, o.ProductCategory,count(1) from Orders o join (select 
r.Country, count(1) CountryOrderCount from Orders r where to_date(r.PlacedDate) 
> '2015-01-01' group by r.Country order by CountryOrderCount DESC LIMIT 5) 
Top5Countries on Top5Countries.Country = o.Country where to_date(o.PlacedDate) 
> '2015-01-01' group by o.Country,o.ProductCategory;

However this change not only affects self joins, it also seems to affect union 
queries as well, like the below query which was again working before(commit 
9a151ce) got broken

select Orders.Country,null,count(1) OrderCount from Orders group by 
Orders.Country,null
union all 
select null,Orders.ProductCategory,count(1) OrderCount from Orders group by 
null, Orders.ProductCategory
also fails with a Analysis exception.
The workaround is to add different aliases for the tables.

> Join on two tables (generated from same one) is broken
> ------------------------------------------------------
>
>                 Key: SPARK-6231
>                 URL: https://issues.apache.org/jira/browse/SPARK-6231
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0, 1.4.0
>            Reporter: Davies Liu
>            Assignee: Michael Armbrust
>            Priority: Blocker
>              Labels: DataFrame
>
> If the two column used in joinExpr come from the same table, they have the 
> same id, then the joniExpr is explained in wrong way.
> {code}
> val df = sqlContext.load(path, "parquet")
> val txns = df.groupBy("cust_id").agg($"cust_id", 
> countDistinct($"day_num").as("txns"))
> val spend = df.groupBy("cust_id").agg($"cust_id", 
> sum($"extended_price").as("spend"))
> val rmJoin = txns.join(spend, txns("cust_id") === spend("cust_id"), "inner")
> scala> rmJoin.explain
> == Physical Plan ==
> CartesianProduct
>  Filter (cust_id#0 = cust_id#0)
>   Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS 
> txns#7L]
>    Exchange (HashPartitioning [cust_id#0], 200)
>     Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS 
> partialSets#25]
>      PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at 
> newParquet.scala:542
>  Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8]
>   Exchange (HashPartitioning [cust_id#17], 200)
>    Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS 
> PartialSum#38]
>     PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at 
> newParquet.scala:542
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-6231) Join on two tables (generated from same one) is broken

Reply via email to