[
https://issues.apache.org/jira/browse/CALCITE-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mostafa Mokhtar updated CALCITE-468:
------------------------------------
Description:
The basic idea is to apply join predicates early in a plan in order to reduce
the size of intermediate query results and, thus, reduce the cost of other
operations. In other words, the idea is to apply the same join predicates twice
or more often in a query plan
In order to reduce the communication costs of a distributed system. Obviously,
semi-join reducers are only effective if the (redundant) semi-joins are cheap
and result in a significant reduction of the size of intermediate
query results.
I propose to extend a query optimizer and integrate semi-join reducer and
join-ordering, etc. into a single query optimization step
Several TPC-DS queries like 24, 64 & 80 run very slow do to the lake of semi
join reduction optimization in Calcite.
Doing a rewrite of Q64 to simulate semi join reduction produced 4x gains.
{code}
Query Total time CPU Intermediate rows
(Million)
Baseline 1,377 356,900
23,940
Semi Join Reduction 343 47,253
23
{code}
Q64 subset
{code}
select
count(*)
FROM
store_sales
JOIN
item ON store_sales.ss_item_sk = item.i_item_sk
JOIN
store_returns ON store_sales.ss_item_sk = store_returns.sr_item_sk
JOIN
(select
cs_item_sk
from
catalog_sales
JOIN catalog_returns ON catalog_sales.cs_item_sk =
catalog_returns.cr_item_sk
group by cs_item_sk
having
sum(cs_ext_list_price) > 2 * sum(cr_refunded_cash + cr_reversed_charge +
cr_store_credit)) cs_ui
ON store_sales.ss_item_sk = cs_ui.cs_item_sk
WHERE
i_color in ('maroon' , 'burnished',
'dim',
'steel',
'navajo',
'chocolate')
and i_current_price between 35 and 35 + 10
and i_current_price between 35 + 1 and 35 + 15
{code}
was:
Currently OPTIQ doesn't push down aggregates for bushy joins and this can be a
performance limitation in some queries.
Ideally before we push down the aggregate we should take into the account the
NDV and reduction of the join which we are trying to push the aggregate
underneath.
On Hive to measure potential gains I wrote a query which has the following
business question based on the TPC-DS schema:
Identify customers from specific gender with good credit rating that spent more
than X amount of money during a specific year.
This query has 3 main components
1) Hub containing store_sales & date_dim
2) Hub containing customer and customer_demographics
3) Aggregate
And for that query I have 2 rewrites :
1) Rewrite that generates a bushy join plan
2) Rewrite that generates a bushy join plan and pushes the aggregate down
Results show that aggregate push down in a bushy plan cuts down the query
runtime by half.
The aggregate push down reduces the intermediate data in the query from 384GB
to 22GB.
Base query finishes in 4,781 seconds with 2,253 GB of intermediate data
Bushy query finished in 1,374 seconds with 384GB of intermediate data
Bushy query + aggregate push down finished in 605 with only 22GB of
intermediate data.
Base query
{code}
"select
c_customer_sk, sum(ss_net_paid) as net_paid
from
customer,
customer_demographics,
store_sales,
date_dim
where
c_current_cdemo_sk = cd_demo_sk
and cd_credit_rating = 'Good'
and cd_gender = 'F'
and ss_customer_sk = c_customer_sk
and ss_customer_sk is not null
and d_year = 2001
and d_date_sk = ss_sold_date_sk
group by c_customer_sk
having net_paid > 0
order by net_paid desc
limit 100;"
{code}
Bushy query
{code}
with ss as
(
select
ss_net_paid, ss_customer_sk
from
store_sales,date_dim
where
d_year = 2001
and d_date_sk = ss_sold_date_sk
and ss_customer_sk is not null
), cc as
(
select
c_customer_sk
from
customer,customer_demographics
where
c_current_cdemo_sk = cd_demo_sk
and cd_credit_rating = 'Good'
and cd_gender = 'F'
)
select
c_customer_sk, sum(ss_net_paid) as net_paid
from
cc,ss
where
ss_customer_sk = c_customer_sk
group by c_customer_sk
having net_paid > 0
order by net_paid desc
limit 100;
{code}
Bushy query + aggregate push down
{code}
{code}
> Introduce semi join reduction optimization in Calcite
> ------------------------------------------------------
>
> Key: CALCITE-468
> URL: https://issues.apache.org/jira/browse/CALCITE-468
> Project: Calcite
> Issue Type: Bug
> Reporter: Mostafa Mokhtar
> Assignee: Laljo John Pullokkaran
> Labels: hive
>
> The basic idea is to apply join predicates early in a plan in order to reduce
> the size of intermediate query results and, thus, reduce the cost of other
> operations. In other words, the idea is to apply the same join predicates
> twice or more often in a query plan
> In order to reduce the communication costs of a distributed system.
> Obviously, semi-join reducers are only effective if the (redundant)
> semi-joins are cheap and result in a significant reduction of the size of
> intermediate
> query results.
> I propose to extend a query optimizer and integrate semi-join reducer and
> join-ordering, etc. into a single query optimization step
> Several TPC-DS queries like 24, 64 & 80 run very slow do to the lake of semi
> join reduction optimization in Calcite.
> Doing a rewrite of Q64 to simulate semi join reduction produced 4x gains.
> {code}
> Query Total time CPU Intermediate rows
> (Million)
> Baseline 1,377 356,900
> 23,940
> Semi Join Reduction 343 47,253
> 23
> {code}
> Q64 subset
> {code}
> select
> count(*)
> FROM
> store_sales
> JOIN
> item ON store_sales.ss_item_sk = item.i_item_sk
> JOIN
> store_returns ON store_sales.ss_item_sk = store_returns.sr_item_sk
> JOIN
> (select
> cs_item_sk
> from
> catalog_sales
> JOIN catalog_returns ON catalog_sales.cs_item_sk =
> catalog_returns.cr_item_sk
> group by cs_item_sk
> having
> sum(cs_ext_list_price) > 2 * sum(cr_refunded_cash + cr_reversed_charge +
> cr_store_credit)) cs_ui
> ON store_sales.ss_item_sk = cs_ui.cs_item_sk
> WHERE
> i_color in ('maroon' , 'burnished',
> 'dim',
> 'steel',
> 'navajo',
> 'chocolate')
> and i_current_price between 35 and 35 + 10
> and i_current_price between 35 + 1 and 35 + 15
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)