Xiao Li created SPARK-14032:
-------------------------------
Summary: Eliminate Unnecessary Distinct/Aggregate
Key: SPARK-14032
URL: https://issues.apache.org/jira/browse/SPARK-14032
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li
Distinct is an expensive operation. If possible, we should avoid it. When the
child operators can guarantee the distinct, we can remove it.
For example, in the following TPC-DS query 38, the child is distinct, and thus,
we can remove the top Distinct after converting Intersect to Left-semi +
Distinct.
{code}
select count(*) from (
select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
where store_sales.ss_sold_date_sk = date_dim.d_date_sk
and store_sales.ss_customer_sk = customer.c_customer_sk
and d_month_seq between [DMS] and [DMS] + 11
intersect
select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
and d_month_seq between [DMS] and [DMS] + 11
intersect
select distinct c_last_name, c_first_name, d_date
from web_sales, date_dim, customer
where web_sales.ws_sold_date_sk = date_dim.d_date_sk
and web_sales.ws_bill_customer_sk = customer.c_customer_sk
and d_month_seq between [DMS] and [DMS] + 11
) hot_cyst
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]