[
https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reynold Xin updated SPARK-12114:
--------------------------------
Description:
For the query
{code}
SELECT c_name, c_custkey, o_orderkey, o_orderdate,
o_totalprice, sum(l_quantity)
FROM customer join orders join lineitem
on c_custkey = o_custkey AND o_orderkey = l_orderkey
left outer join (SELECT l_orderkey tmp_orderkey
FROM lineitem
GROUP BY l_orderkey
HAVING sum(l_quantity) > 300) tmp
on o_orderkey = tmp_orderkey
WHERE tmp_orderkey IS NOT NULL
GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
ORDER BY o_totalprice DESC, o_orderdate
{code}
The optimizedPlan is
{code}
Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
Aggregate
\[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48],
\[c_name#38,c_custkey#37,o_orderkey#45,
o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
{color: green}Project
\[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
Filter IS NOT NULL tmp_orderkey#35
Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
Join Inner, Some((c_custkey#37 = o_custkey#46))
MetastoreRelation default, customer, None
Join Inner, Some((o_orderkey#45 = l_orderkey#54))
MetastoreRelation default, orders, None
MetastoreRelation default, lineitem, None
Project \[tmp_orderkey#35]
Filter havingCondition#86
Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS
havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
Project \[l_orderkey#70,l_quantity#74]
MetastoreRelation default, lineitem, None
{code}
Due to the pattern highlighted in green that the ColumnPruning rule fails to
deal with, all columns of lineitem and orders tables are scanned. The unneeded
columns are also involved in the data Shuffling. The performance is extremely
bad if any one of the two tables is big.
was:
For the query
{quote}
SELECT c_name, c_custkey, o_orderkey, o_orderdate,
o_totalprice, sum(l_quantity)
FROM customer join orders join lineitem
on c_custkey = o_custkey AND o_orderkey = l_orderkey
left outer join (SELECT l_orderkey tmp_orderkey
FROM lineitem
GROUP BY l_orderkey
HAVING sum(l_quantity) > 300) tmp
on o_orderkey = tmp_orderkey
WHERE tmp_orderkey IS NOT NULL
GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
ORDER BY o_totalprice DESC, o_orderdate
{quote}
The optimizedPlan is
{quote}
Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
Aggregate
\[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48],
\[c_name#38,c_custkey#37,o_orderkey#45,
o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
{color: green}Project
\[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
Filter IS NOT NULL tmp_orderkey#35
Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
Join Inner, Some((c_custkey#37 = o_custkey#46))
MetastoreRelation default, customer, None
Join Inner, Some((o_orderkey#45 = l_orderkey#54))
MetastoreRelation default, orders, None
MetastoreRelation default, lineitem, None
Project \[tmp_orderkey#35]
Filter havingCondition#86
Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS
havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
Project \[l_orderkey#70,l_quantity#74]
MetastoreRelation default, lineitem, None
{quote}
Due to the pattern highlighted in green that the ColumnPruning rule fails to
deal with, all columns of lineitem and orders tables are scanned. The unneeded
columns are also involved in the data Shuffling. The performance is extremely
bad if any one of the two tables is big.
> ColumnPruning rule fails in case of "Project <- Filter <- Join"
> ---------------------------------------------------------------
>
> Key: SPARK-12114
> URL: https://issues.apache.org/jira/browse/SPARK-12114
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Min Qiu
>
> For the query
> {code}
> SELECT c_name, c_custkey, o_orderkey, o_orderdate,
> o_totalprice, sum(l_quantity)
> FROM customer join orders join lineitem
> on c_custkey = o_custkey AND o_orderkey = l_orderkey
> left outer join (SELECT l_orderkey tmp_orderkey
> FROM lineitem
> GROUP BY l_orderkey
> HAVING sum(l_quantity) > 300) tmp
> on o_orderkey = tmp_orderkey
> WHERE tmp_orderkey IS NOT NULL
> GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
> ORDER BY o_totalprice DESC, o_orderdate
> {code}
> The optimizedPlan is
> {code}
> Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
>
> Aggregate
> \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48],
> \[c_name#38,c_custkey#37,o_orderkey#45,
> o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
> {color: green}Project
> \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
> Filter IS NOT NULL tmp_orderkey#35
> Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
> Join Inner, Some((c_custkey#37 = o_custkey#46))
> MetastoreRelation default, customer, None
> Join Inner, Some((o_orderkey#45 = l_orderkey#54))
> MetastoreRelation default, orders, None
> MetastoreRelation default, lineitem, None
> Project \[tmp_orderkey#35]
> Filter havingCondition#86
> Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS
> havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
> Project \[l_orderkey#70,l_quantity#74]
> MetastoreRelation default, lineitem, None
> {code}
> Due to the pattern highlighted in green that the ColumnPruning rule fails to
> deal with, all columns of lineitem and orders tables are scanned. The
> unneeded columns are also involved in the data Shuffling. The performance is
> extremely bad if any one of the two tables is big.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]