Hi

I met the same problem and you can take a look at this JIRA.
I also pull a request, but it does not get merged:
https://github.com/apache/calcite/pull/2253
This optimization I think is very common and useful, I'll continue this
work soon.

cc Julian

Regards!

Aron Tao


Priyendra Deshwal <[email protected]> 于2021年3月1日周一 下午4:52写道:

> That rule was already enabled for me. I also tried enabling a collection of
> other *_REDUCE_EXPRESSION rules but it did not help. Reading the underlying
> code, it suggests that these rules help with constant elimination but
> perhaps do not do the kind of common expression factoring that may be
> needed for this particular case.
>
>
> On Sun, Feb 28, 2021 at 11:32 PM Fan Liya <[email protected]> wrote:
>
> > Hi Priyendra,
> >
> > We have FilterReduceExpressionsRule which reduces filter conditions.
> > It performs the simplification based on
> org.apache.calcite.rex.RexSimplify.
> >
> > Best,
> > Liya Fan
> >
> > On Mon, Mar 1, 2021 at 1:45 PM Priyendra Deshwal <[email protected]>
> > wrote:
> >
> > > Hello friends,
> > >
> > > I am playing around with TPC-DS schema and playing with the following
> > > simplified variant of query41.
> > >
> > > select  distinct(i_product_name)
> > >  from item i1
> > >  where i_manufact_id between 738 and 738+40
> > >    and (select count(*) as item_cnt
> > >         from item
> > >         where (i_manufact = i1.i_manufact and i_category = 'Women') or
> > >                    (i_manufact = i1.i_manufact and i_category = 'Men'))
> > > 0
> > >  order by i_product_name
> > >  limit 100
> > >
> > > This results in the following optimized plan. Note that the join
> > condition
> > > (i_manufact = i1.i_manufact) is not clearly expressed in this query
> since
> > > it is repeated in both OR clauses of the inner query. This results in a
> > > nested loop join and even the filter on i_category is not pushed all
> the
> > > way down to the query.
> > >
> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > > 2.929266516286664E9, id = 408
> > >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 5.4675E7,
> cumulative
> > > cost = 2.929266416286664E9, id = 406
> > >     EnumerableAggregate(group=[{2}]): rowcount = 5.4675E7, cumulative
> > cost
> > > = 2.874591416286664E9, id = 404
> > >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> > proj#0..4=[{exprs}],
> > > $condition=[$t11]): rowcount = 5.4675E8, cumulative cost =
> > > 2.819916416286664E9, id = 410
> > >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > > rowcount = 2.187E9, cumulative cost = 2.273166416286664E9, id = 400
> > >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > > 9000.0, id = 386
> > >             BindableTableScan(table=[[default, ITEM]],
> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 360
> > >           EnumerableAggregate(group=[{2}], ITEM_CNT=[COUNT()]):
> rowcount
> > =
> > > 810000.0, cumulative cost = 8.193105E7, id = 398
> > >             EnumerableNestedLoopJoin(condition=[OR(AND(=($1, $2), =($0,
> > > 'Women')), AND(=($1, $2), =($0, 'Men')))], joinType=[inner]): rowcount
> =
> > > 8100000.0, cumulative cost = 8.10198E7, id = 396
> > >               EnumerableInterpreter: rowcount = 18000.0, cumulative
> cost
> > =
> > > 9000.0, id = 389
> > >                 BindableTableScan(table=[[default, ITEM]],
> projects=[[12,
> > > 14]]): rowcount = 18000.0, cumulative cost = 29.999999999999996, id =
> 231
> > >               EnumerableAggregate(group=[{0}]): rowcount = 1800.0,
> > > cumulative cost = 10800.0, id = 394
> > >                 EnumerableInterpreter: rowcount = 18000.0, cumulative
> > cost
> > > = 9000.0, id = 392
> > >                   BindableTableScan(table=[[default, ITEM]],
> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[14]]):
> > > rowcount = 18000.0, cumulative cost = 11.25, id = 364
> > >
> > > A simple rewrite of the query as follows where we "factor" out the join
> > > condition to the top level AND does make the plan significantly better
> > > (hash join with complete filter push down to the source).
> > >
> > > select  distinct(i_product_name)
> > >  from item i1
> > >  where i_manufact_id between 738 and 738+40
> > >    and (select count(*) as item_cnt
> > >         from item
> > >         where i_manufact = i1.i_manufact and
> > >                    (i_category = 'Women' or i_category = 'Men')) > 0
> > >  order by i_product_name
> > >  limit 100
> > >
> > > EnumerableLimit(fetch=[100]): rowcount = 100.0, cumulative cost =
> > > 6523491.28666381, id = 336
> > >   EnumerableSort(sort0=[$0], dir0=[ASC]): rowcount = 121500.0,
> cumulative
> > > cost = 6523391.28666381, id = 334
> > >     EnumerableAggregate(group=[{2}]): rowcount = 121500.0, cumulative
> > cost
> > > = 6401891.28666381, id = 332
> > >       EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)],
> > > expr#6=[0:BIGINT], expr#7=[0], expr#8=[>($t6, $t7)], expr#9=[AND($t5,
> > > $t8)], expr#10=[>($t4, $t7)], expr#11=[OR($t9, $t10)],
> > proj#0..4=[{exprs}],
> > > $condition=[$t11]): rowcount = 1215000.0, cumulative cost =
> > > 6280391.28666381, id = 338
> > >         EnumerableHashJoin(condition=[=($1, $3)], joinType=[left]):
> > > rowcount = 4860000.0, cumulative cost = 5065391.28666381, id = 328
> > >           EnumerableInterpreter: rowcount = 18000.0, cumulative cost =
> > > 9000.0, id = 321
> > >             BindableTableScan(table=[[default, ITEM]],
> > > filters=[[AND(>=($13, 738), <=($13, +(738, 40)))]], projects=[[13, 14,
> > > 21]]): rowcount = 18000.0, cumulative cost = 18.75, id = 298
> > >           EnumerableAggregate(group=[{0}], ITEM_CNT=[COUNT()]):
> rowcount
> > =
> > > 1800.0, cumulative cost = 11025.0, id = 326
> > >             EnumerableInterpreter: rowcount = 18000.0, cumulative cost
> =
> > > 9000.0, id = 324
> > >               BindableTableScan(table=[[default, ITEM]],
> > > filters=[[OR(=($12, 'Women'), =($12, 'Men'))]], projects=[[14]]):
> > rowcount
> > > = 18000.0, cumulative cost = 11.25, id = 302
> > >
> > > Is there any built-in Calcite rule that I can invoke which can do this
> > type
> > > of factorization of common where sub-conditions?
> > >
> > > Thanks!
> > >
> >
>

Reply via email to