Hi Priyendra, We have FilterReduceExpressionsRule which reduces filter conditions. It performs the simplification based on org.apache.calcite.rex.RexSimplify.
Best, Liya Fan On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <[email protected]> wrote: > Hello friends, > > I am playing around with TPC-DS schema and playing with the following > simplified variant of query41. > > select distinct(i_product_name) > from item i1 > where i_manufact_id between 738 and 738+40 > and (select count(*) as item_cnt > from item > where (i_manufact = i1.i_manufact and i_category = 'Women') or > (i_manufact = i1.i_manufact and i_category = 'Men')) > 0 > order by i_product_name > limit 100 > > This results in the following optimized plan. Note that the join condition > (i_manufact = i1.i_manufact) is not clearly expressed in this query since > it is repeated in both OR clauses of the inner query. This results in a > nested loop join and even the filter on i_category is not pushed all the > way down to the query. > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost = > 2.929266516286664E9, id = 408 > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative > cost = 2.929266416286664E9, id = 406 > EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative cost > = 2.874591416286664E9, id = 404 > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)], > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5, > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}], > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost = > 2.819916416286664E9, id = 410 > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]): > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400 > EnumerableInterpreter: rowcount = 18000.0, cumulative cost = > 9000.0, id = 386 > BindableTableScan(table=[[default, ITEM]], > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14, > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360 > EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount = > 810000.0, cumulative cost = 8.193105E7, id = 398 > EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0, > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount = > 8100000.0, cumulative cost = 8.10198E7, id = 396 > EnumerableInterpreter: rowcount = 18000.0, cumulative cost = > 9000.0, id = 389 > BindableTableScan(table=[[default, ITEM]], projects=[[12, > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231 > EnumerableAggregate(group=[{0}]): rowcount = 1800.0, > cumulative cost = 10800.0, id = 394 > EnumerableInterpreter: rowcount = 18000.0, cumulative cost > = 9000.0, id = 392 > BindableTableScan(table=[[default, ITEM]], > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]): > rowcount = 18000.0, cumulative cost = 11.25, id = 364 > > A simple rewrite of the query as follows where we "factor" out the join > condition to the top level AND does make the plan significantly better > (hash join with complete filter push down to the source). > > select distinct(i_product_name) > from item i1 > where i_manufact_id between 738 and 738+40 > and (select count(*) as item_cnt > from item > where i_manufact = i1.i_manufact and > (i_category = 'Women' or i_category = 'Men')) > 0 > order by i_product_name > limit 100 > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost = > 6523491.28666381, id = 336 > EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative > cost = 6523391.28666381, id = 334 > EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative cost > = 6401891.28666381, id = 332 > EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)], > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5, > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}], > $condition=[$t11]): rowcount = 1215000.0, cumulative cost = > 6280391.28666381, id = 338 > EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]): > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328 > EnumerableInterpreter: rowcount = 18000.0, cumulative cost = > 9000.0, id = 321 > BindableTableScan(table=[[default, ITEM]], > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14, > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298 > EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount = > 1800.0, cumulative cost = 11025.0, id = 326 > EnumerableInterpreter: rowcount = 18000.0, cumulative cost = > 9000.0, id = 324 > BindableTableScan(table=[[default, ITEM]], > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]): rowcount > = 18000.0, cumulative cost = 11.25, id = 302 > > Is there any built-in Calcite rule that I can invoke which can do this type > of factorization of common where sub-conditions? > > Thanks! >
