Hi Priyendra,

We have FilterReduceExpressionsRule which reduces filter conditions.
It performs the simplification based on org.apache.calcite.rex.RexSimplify.

Best,
Liya Fan

On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <[email protected]>
wrote:

> Hello friends,
>
> I am playing around with TPC-DS schema and playing with the following
> simplified variant of query41.
>
> select  distinct(i_product_name)
>  from item i1
>  where i_manufact_id between 738 and 738+40
>    and (select count(*) as item_cnt
>         from item
>         where (i_manufact = i1.i_manufact and i_category = 'Women') or
>                    (i_manufact = i1.i_manufact and i_category = 'Men')) > 0
>  order by i_product_name
>  limit 100
>
> This results in the following optimized plan. Note that the join condition
> (i_manufact = i1.i_manufact) is not clearly expressed in this query since
> it is repeated in both OR clauses of the inner query. This results in a
> nested loop join and even the filter on i_category is not pushed all the
> way down to the query.
>
> EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> 2.929266516286664E9, id = 408
>   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7, cumulative
> cost = 2.929266416286664E9, id = 406
>     EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative cost
> = 2.874591416286664E9, id = 404
>       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
> $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
> 2.819916416286664E9, id = 410
>         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
>           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 386
>             BindableTableScan(table=[[default, ITEM]],
> filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
>           EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]): rowcount =
> 810000.0, cumulative cost = 8.193105E7, id = 398
>             EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
> 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount =
> 8100000.0, cumulative cost = 8.10198E7, id = 396
>               EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 389
>                 BindableTableScan(table=[[default, ITEM]], projects=[[12,
> 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id = 231
>               EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
> cumulative cost = 10800.0, id = 394
>                 EnumerableInterpreter: rowcount = 18000.0, cumulative cost
> = 9000.0, id = 392
>                   BindableTableScan(table=[[default, ITEM]],
> filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
> rowcount = 18000.0, cumulative cost = 11.25, id = 364
>
> A simple rewrite of the query as follows where we "factor" out the join
> condition to the top level AND does make the plan significantly better
> (hash join with complete filter push down to the source).
>
> select  distinct(i_product_name)
>  from item i1
>  where i_manufact_id between 738 and 738+40
>    and (select count(*) as item_cnt
>         from item
>         where i_manufact = i1.i_manufact and
>                    (i_category = 'Women' or i_category = 'Men')) > 0
>  order by i_product_name
>  limit 100
>
> EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> 6523491.28666381, id = 336
>   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0, cumulative
> cost = 6523391.28666381, id = 334
>     EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative cost
> = 6401891.28666381, id = 332
>       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)], proj#0..4=[{exprs}],
> $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
> 6280391.28666381, id = 338
>         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
>           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 321
>             BindableTableScan(table=[[default, ITEM]],
> filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
>           EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]): rowcount =
> 1800.0, cumulative cost = 11025.0, id = 326
>             EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> 9000.0, id = 324
>               BindableTableScan(table=[[default, ITEM]],
> filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]): rowcount
> = 18000.0, cumulative cost = 11.25, id = 302
>
> Is there any built-in Calcite rule that I can invoke which can do this type
> of factorization of common where sub-conditions?
>
> Thanks!
>

Reply via email to