Hello friends,

I am playing around with TPC-DS schema and playing with the following
simplified variant of query41.

select  distinct(i_product_name)
 from item i1
 where i_manufact_id between 738 and 738+40
   and (select count(*) as item_cnt
        from item
        where (i_manufact = i1.i_manufact and i_category = 'Women') or
                   (i_manufact = i1.i_manufact and i_category = 'Men')) > 0
 order by i_product_name
 limit 100

This results in the following optimized plan. Note that the join condition
(i_manufact = i1.i_manufact) is not clearly expressed in this query since
it is repeated in both OR clauses of the inner query. This results in a
nested loop join and even the filter on i_category is not pushed all the
way down to the query.

EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
2.929266516286664E9, id = 408
  EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative
cost = 2.929266416286664E9, id = 406
    EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative cost
= 2.874591416286664E9, id = 404
      EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
$t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
$condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
2.819916416286664E9, id = 410
        EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
          EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 386
            BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
          EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount =
810000.0, cumulative cost = 8.193105E7, id = 398
            EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount =
8100000.0, cumulative cost = 8.10198E7, id = 396
              EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 389
                BindableTableScan(table=[[default, ITEM]], projects=[[12,
14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231
              EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
cumulative cost = 10800.0, id = 394
                EnumerableInterpreter: rowcount = 18000.0, cumulative cost
= 9000.0, id = 392
                  BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
rowcount = 18000.0, cumulative cost = 11.25, id = 364

A simple rewrite of the query as follows where we "factor" out the join
condition to the top level AND does make the plan significantly better
(hash join with complete filter push down to the source).

select  distinct(i_product_name)
 from item i1
 where i_manufact_id between 738 and 738+40
   and (select count(*) as item_cnt
        from item
        where i_manufact = i1.i_manufact and
                   (i_category = 'Women' or i_category = 'Men')) > 0
 order by i_product_name
 limit 100

EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
6523491.28666381, id = 336
  EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative
cost = 6523391.28666381, id = 334
    EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative cost
= 6401891.28666381, id = 332
      EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
$t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
$condition=[$t11]): rowcount = 1215000.0, cumulative cost =
6280391.28666381, id = 338
        EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
          EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 321
            BindableTableScan(table=[[default, ITEM]],
filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
          EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount =
1800.0, cumulative cost = 11025.0, id = 326
            EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
9000.0, id = 324
              BindableTableScan(table=[[default, ITEM]],
filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]): rowcount
= 18000.0, cumulative cost = 11.25, id = 302

Is there any built-in Calcite rule that I can invoke which can do this type
of factorization of common where sub-conditions?

Thanks!

Reply via email to