Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14452
  
    There is an example which current shuffle exchange reuse can't work, TPC-DS 
query 47. The query looks like:
    
        with v1 as(
        select i_category, i_brand,
               s_store_name, s_company_name,
               d_year, d_moy,
               sum(ss_sales_price) sum_sales,
               avg(sum(ss_sales_price)) over
                 (partition by i_category, i_brand,
                            s_store_name, s_company_name, d_year)
                 avg_monthly_sales,
               rank() over
                 (partition by i_category, i_brand,
                            s_store_name, s_company_name
                  order by d_year, d_moy) rn
        from item, store_sales, date_dim, store
        where ss_item_sk = i_item_sk and
              ss_sold_date_sk = d_date_sk and
              ss_store_sk = s_store_sk and
              (
                d_year = 1999 or
                ( d_year = 1999-1 and d_moy =12) or
                ( d_year = 1999+1 and d_moy =1)
              )
        group by i_category, i_brand,
                 s_store_name, s_company_name,
                 d_year, d_moy),
        v2 as(
        select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name, 
v1.d_year,
                            v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, 
v1_lag.sum_sales psum,
                            v1_lead.sum_sales nsum
        from v1, v1 v1_lag, v1 v1_lead
        where v1.i_category = v1_lag.i_category and
              v1.i_category = v1_lead.i_category and
              v1.i_brand = v1_lag.i_brand and
              v1.i_brand = v1_lead.i_brand and
              v1.s_store_name = v1_lag.s_store_name and
              v1.s_store_name = v1_lead.s_store_name and
              v1.s_company_name = v1_lag.s_company_name and
              v1.s_company_name = v1_lead.s_company_name and
              v1.rn = v1_lag.rn + 1 and
              v1.rn = v1_lead.rn - 1)
        select * from v2
        where  d_year = 1999 and
               avg_monthly_sales > 0 and
               case when avg_monthly_sales > 0 then abs(sum_sales - 
avg_monthly_sales) / avg_monthly_sales else null end > 0.1
        order by sum_sales - avg_monthly_sales, 3
        limit 100
    
    In the physical plan, the subquery v1 is the common subquery which is used 
three times (i.e., v1, v1_lag and v1_lead) in the subquery v2. The three v1 
subqueries are the child of three shuffle exchange nodes. Because they have 
different columns (rn, rn + 1, rn - 1) in hash partitioning, the three shuffle 
exchange nodes are not the same and can't be reused. However, with this patch, 
we can reuse the common subquery. By doing that, we observe significant 
improvement in q47.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to