Github user viirya commented on the issue:
https://github.com/apache/spark/pull/14452
Q47 is a typical query that can benefit from this PR:
with v1 as(
select i_category, i_brand,
s_store_name, s_company_name,
d_year, d_moy,
sum(ss_sales_price) sum_sales,
avg(sum(ss_sales_price)) over
(partition by i_category, i_brand,
s_store_name, s_company_name, d_year)
avg_monthly_sales,
rank() over
(partition by i_category, i_brand,
s_store_name, s_company_name
order by d_year, d_moy) rn
from item, store_sales, date_dim, store
where ss_item_sk = i_item_sk and
ss_sold_date_sk = d_date_sk and
ss_store_sk = s_store_sk and
(
d_year = 1999 or
( d_year = 1999-1 and d_moy =12) or
( d_year = 1999+1 and d_moy =1)
group by i_category, i_brand,
s_store_name, s_company_name,
d_year, d_moy),
v2 as(
select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name,
v1.d_year,
v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, v1_lag.sum_sales
psum,
v1_lead.sum_sales nsum
from v1, v1 v1_lag, v1 v1_lead
where v1.i_category = v1_lag.i_category and
v1.i_category = v1_lead.i_category and
v1.i_brand = v1_lag.i_brand and
v1.i_brand = v1_lead.i_brand and
v1.s_store_name = v1_lag.s_store_name and
v1.s_store_name = v1_lead.s_store_name and
v1.s_company_name = v1_lag.s_company_name and
v1.s_company_name = v1_lead.s_company_name and
v1.rn = v1_lag.rn + 1 and
v1.rn = v1_lead.rn - 1)
select * from v2
where d_year = 1999 and
avg_monthly_sales > 0 and
case when avg_monthly_sales > 0 then abs(sum_sales -
avg_monthly_sales) / avg_monthly_sales else null end > 0.1
order by sum_sales - avg_monthly_sales, 3
limit 100
It has 2 CTEs, `v1` and `v2`. `v2` joins three duplicated `v1` plans. There
are no disjunctive predicates needed to be pushed down to those `v1` plans. So
the three duplicated `v1` plans have the same output data in the end. The
physical plan of `v1` looks complicated.
Obviously without this PR we will run three physical plans of `v1`. This PR
runs the physical plan of `v1` for only one time and cache its output data.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]