Hello friends,
I am playing around with TPC-DS schema and playing with the following
simplified variant of query41.
select distinct(i_product_name)
from item i1
where i_manufact_id between 738 and 738+40
and (select count(*) as item_cnt
from item
where (i_manufact = i1.i_manufact and i_category = 'Women') or
(i_manufact = i1.i_manufact and i_category = 'Men')) > 0
order by i_product_name
limit 100
This results in the following optimized plan. Note that the join condition
(i_manufact = i1.i_manufact) is not clearly expressed in this query since
it is repeated in both OR clauses of the inner query. This results in a
nested loop join and even the filter on i_category is not pushed all the
way down to the query.
EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
2.929266516286664E9, id = 408
EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative
cost = 2.929266416286664E9, id = 406
EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative cost
= 2.874591416286664E9, id = 404
EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
$t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
$condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
2.819916416286664E9, id = 410
EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 386
BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount =
810000.0, cumulative cost = 8.193105E7, id = 398
EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount =
8100000.0, cumulative cost = 8.10198E7, id = 396
EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 389
BindableTableScan(table=[[default, ITEM]], projects=[[12,
14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231
EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
cumulative cost = 10800.0, id = 394
EnumerableInterpreter: rowcount = 18000.0, cumulative cost
= 9000.0, id = 392
BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
rowcount = 18000.0, cumulative cost = 11.25, id = 364
A simple rewrite of the query as follows where we "factor" out the join
condition to the top level AND does make the plan significantly better
(hash join with complete filter push down to the source).
select distinct(i_product_name)
from item i1
where i_manufact_id between 738 and 738+40
and (select count(*) as item_cnt
from item
where i_manufact = i1.i_manufact and
(i_category = 'Women' or i_category = 'Men')) > 0
order by i_product_name
limit 100
EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
6523491.28666381, id = 336
EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative
cost = 6523391.28666381, id = 334
EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative cost
= 6401891.28666381, id = 332
EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
$t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
$condition=[$t11]): rowcount = 1215000.0, cumulative cost =
6280391.28666381, id = 338
EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 321
BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount =
1800.0, cumulative cost = 11025.0, id = 326
EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 324
BindableTableScan(table=[[default, ITEM]],
filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]): rowcount
= 18000.0, cumulative cost = 11.25, id = 302
Is there any built-in Calcite rule that I can invoke which can do this type
of factorization of common where sub-conditions?
Thanks!