[jira] [Created] (SPARK-14032) Eliminate Unnecessary Distinct/Aggregate

Xiao Li (JIRA) Sun, 20 Mar 2016 15:14:56 -0700

Xiao Li created SPARK-14032:
-------------------------------

             Summary: Eliminate Unnecessary Distinct/Aggregate
                 Key: SPARK-14032
                 URL: https://issues.apache.org/jira/browse/SPARK-14032
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Xiao Li



Distinct is an expensive operation. If possible, we should avoid it. When the 
child operators can guarantee the distinct, we can remove it.

For example, in the following TPC-DS query 38, the child is distinct, and thus, 
we can remove the top Distinct after converting Intersect to Left-semi + 
Distinct.

{code}
select count(*) from (
    select distinct c_last_name, c_first_name, d_date
    from store_sales, date_dim, customer
          where store_sales.ss_sold_date_sk = date_dim.d_date_sk
      and store_sales.ss_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from catalog_sales, date_dim, customer
          where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
      and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from web_sales, date_dim, customer
          where web_sales.ws_sold_date_sk = date_dim.d_date_sk
      and web_sales.ws_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
) hot_cyst

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-14032) Eliminate Unnecessary Distinct/Aggregate

Reply via email to