Github user viirya commented on the issue:
https://github.com/apache/spark/pull/14452
There is an example which current shuffle exchange reuse can't work, TPC-DS
query 47. The query looks like:
with v1 as(
select i_category, i_brand,
s_store_name, s_company_name,
d_year, d_moy,
sum(ss_sales_price) sum_sales,
avg(sum(ss_sales_price)) over
(partition by i_category, i_brand,
s_store_name, s_company_name, d_year)
avg_monthly_sales,
rank() over
(partition by i_category, i_brand,
s_store_name, s_company_name
order by d_year, d_moy) rn
from item, store_sales, date_dim, store
where ss_item_sk = i_item_sk and
ss_sold_date_sk = d_date_sk and
ss_store_sk = s_store_sk and
(
d_year = 1999 or
( d_year = 1999-1 and d_moy =12) or
( d_year = 1999+1 and d_moy =1)
)
group by i_category, i_brand,
s_store_name, s_company_name,
d_year, d_moy),
v2 as(
select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name,
v1.d_year,
v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales,
v1_lag.sum_sales psum,
v1_lead.sum_sales nsum
from v1, v1 v1_lag, v1 v1_lead
where v1.i_category = v1_lag.i_category and
v1.i_category = v1_lead.i_category and
v1.i_brand = v1_lag.i_brand and
v1.i_brand = v1_lead.i_brand and
v1.s_store_name = v1_lag.s_store_name and
v1.s_store_name = v1_lead.s_store_name and
v1.s_company_name = v1_lag.s_company_name and
v1.s_company_name = v1_lead.s_company_name and
v1.rn = v1_lag.rn + 1 and
v1.rn = v1_lead.rn - 1)
select * from v2
where d_year = 1999 and
avg_monthly_sales > 0 and
case when avg_monthly_sales > 0 then abs(sum_sales -
avg_monthly_sales) / avg_monthly_sales else null end > 0.1
order by sum_sales - avg_monthly_sales, 3
limit 100
In the physical plan, the subquery v1 is the common subquery which is used
three times (i.e., v1, v1_lag and v1_lead) in the subquery v2. The three v1
subqueries are the child of three shuffle exchange nodes. Because they have
different columns (rn, rn + 1, rn - 1) in hash partitioning, the three shuffle
exchange nodes are not the same and can't be reused. However, with this patch,
we can reuse the common subquery. By doing that, we observe significant
improvement in q47.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]