[
https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037023#comment-15037023
]
Min Qiu commented on SPARK-12114:
---------------------------------
The [pull request|https://github.com/apache/spark/pull/10087] is related to
this issue, but the pull request contains changes to the other problem. I will
work out a new pull request for this issue only.
> ColumnPruning rule fails in case of "Project <- Filter <- Join"
> ---------------------------------------------------------------
>
> Key: SPARK-12114
> URL: https://issues.apache.org/jira/browse/SPARK-12114
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Min Qiu
>
> For the query
> {quote}
> SELECT c_name, c_custkey, o_orderkey, o_orderdate,
> o_totalprice, sum(l_quantity)
> FROM customer join orders join lineitem
> on c_custkey = o_custkey AND o_orderkey = l_orderkey
> left outer join (SELECT l_orderkey tmp_orderkey
> FROM lineitem
> GROUP BY l_orderkey
> HAVING sum(l_quantity) > 300) tmp
> on o_orderkey = tmp_orderkey
> WHERE tmp_orderkey IS NOT NULL
> GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
> ORDER BY o_totalprice DESC, o_orderdate
> {quote}
> The optimizedPlan is
> {quote}
> Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
>
> Aggregate
> \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48],
> \[c_name#38,c_custkey#37,o_orderkey#45,
> o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
> {color: green}Project
> \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
> Filter IS NOT NULL tmp_orderkey#35
> Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
> Join Inner, Some((c_custkey#37 = o_custkey#46))
> MetastoreRelation default, customer, None
> Join Inner, Some((o_orderkey#45 = l_orderkey#54))
> MetastoreRelation default, orders, None
> MetastoreRelation default, lineitem, None
> Project \[tmp_orderkey#35]
> Filter havingCondition#86
> Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS
> havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
> Project \[l_orderkey#70,l_quantity#74]
> MetastoreRelation default, lineitem, None
> {quote}
> Due to the pattern highlighted in green that the ColumnPruning rule fails to
> deal with, all columns of lineitem and orders tables are scanned. The
> unneeded columns are also involved in the data Shuffling. The performance is
> extremely bad if any one of the two tables is big.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]