Re: 回复: FIlterIntoJoinRule applied without complete result

Juri Petersen Mon, 31 Mar 2025 00:48:59 -0700

Hi,
I tried applying the following rules for my example query:

        final RuleSet wayangRules = RuleSets.ofList(
                CoreRules.FILTER_INTO_JOIN,
                CoreRules.MULTI_JOIN_OPTIMIZE_BUSHY,
                CoreRules.JOIN_COMMUTE,
                CoreRules.JOIN_ASSOCIATE
       );


The input tree looks like this:

LogicalAggregate(group=[{}], uncredited_voiced_character=[MIN($0)], 
russian_movie=[MIN($1)])
  LogicalProject(name=[$1], title=[$31])
    LogicalFilter(condition=[AND(LIKE($11, '%(voice)%'), LIKE($11, 
'%(uncredited)%'), =($16, '[ru]'), =($29, 'actor'), >($34, 2005), =($30, $24), 
=($30, $9), =($9, $24), =($0, $10), =($28, $13), =($14, $25), =($21, $26))])
      LogicalJoin(condition=[true], joinType=[inner])
        LogicalJoin(condition=[true], joinType=[inner])
          LogicalJoin(condition=[true], joinType=[inner])
            LogicalJoin(condition=[true], joinType=[inner])
              LogicalJoin(condition=[true], joinType=[inner])
                LogicalJoin(condition=[true], joinType=[inner])
                  LogicalTableScan(table=[[postgres, char_name]])
                  LogicalTableScan(table=[[postgres, cast_info]])
                LogicalTableScan(table=[[postgres, company_name]])
              LogicalTableScan(table=[[postgres, company_type]])
            LogicalTableScan(table=[[postgres, movie_companies]])
          LogicalTableScan(table=[[postgres, role_type]])
        LogicalTableScan(table=[[postgres, title]])

The resulting converted tree is close to what I desire, however one 
multi-condition join can't be pushed down, leading to a tree with on cartesian 
product remaining:

WayangAggregate(group=[{}], uncredited_voiced_character=[MIN($0)], 
russian_movie=[MIN($1)])
  WayangProject(name=[$1], title=[$31])
    WayangJoin(condition=[AND(=($24, $30), =($28, $13))], joinType=[inner])
      WayangJoin(condition=[=($9, $24)], joinType=[inner])
        WayangJoin(condition=[=($0, $10)], joinType=[inner])
          WayangTableScan(table=[[postgres, char_name]])
          WayangFilter(condition=[AND(LIKE($4, '%(voice)%'), LIKE($4, 
'%(uncredited)%'))])
            WayangTableScan(table=[[postgres, cast_info]])
        WayangJoin(condition=[=($0, $11)], joinType=[inner])
          WayangFilter(condition=[=($2, '[ru]')])
            WayangTableScan(table=[[postgres, company_name]])
          WayangJoin(condition=[=($0, $5)], joinType=[inner])
            WayangTableScan(table=[[postgres, company_type]])
            WayangTableScan(table=[[postgres, movie_companies]])
      WayangJoin(condition=[true], joinType=[inner])
        WayangFilter(condition=[=($1, 'actor')])
          WayangTableScan(table=[[postgres, role_type]])
        WayangFilter(condition=[>($4, 2005)])
          WayangTableScan(table=[[postgres, title]])


Looking at my input query, the role_type table has a specified join condition 
on cast_info.
Am I missing a detail that prevents me from being able to deconstruct the 
multi-conditional join here?

Any help would be greatly appreciated.

Best,
Juri

On 2025/03/27 10:57:26 Dong Silun wrote:
> Hi Juri,
> As Alessandro said, the Join order prevents the predicates from being pushed 
> down to the ideal position.
> You can try to use the two rules CoreRules.JOIN_COMMUTE and 
> CoreRules.JOIN_ASSOCIATE instead of the heuristic/dp join reorder algorithm. 
> In the case of all inner joins, CoreRules.JOIN_COMMUTE and 
> CoreRules.JOIN_ASSOCIATE will generate all join order possibilities (when 
> using VolcanoPlanner), so as to get the join order that can smoothly push all 
> predicates down to the ideal position (combined with the FilterIntoJoin rule).
> However, the optimization process may be time-consuming because there are a 
> total of 7 tables involved in join and the commutative and associative rules 
> are used to enumerate every possibility.
> I didn't actually run your example, I just provided an idea, I hope it can 
> help you.
> 
> Best,
> Silun
> 
> ________________________________
> 发件人: Juri Petersen <j...@apache.org>
> 发送时间: 2025年3月27日 16:40
> 收件人: dev@calcite.apache.org <dev@calcite.apache.org>
> 主题: Re: FIlterIntoJoinRule applied without complete result
> 
> Hi,
> Thank you for your answer!
> I see your point about join ordering, thats also why I tried using the 
> MULTI_JOIN_OPTIMZE CoreRule before.
> I tried it again just now, and these rules still don't resolve my problem:
> 
>         final RuleSet rules = RuleSets.ofList(
>                 CoreRules.FILTER_INTO_JOIN,
>                 CoreRules.MULTI_JOIN_OPTIMIZE
>         );
> 
> I tried both the smart and dumb FILTER_INTO_JOIN and also the bushy version 
> of MULTI_JOIN_OPTIMIZE.
> 
> The message I get when trying to optimize the plan is the following:
> 
> org.apache.calcite.plan.RelOptPlanner$CannotPlanException: There are not 
> enough rules to produce a node with desired properties: convention=NONE. All 
> the inputs have relevant nodes, however the cost is still infinite.
> Root: rel#55:RelSubset#15.NONE
> Original rel:
> LogicalAggregate(group=[{}], uncredited_voiced_character=[MIN($0)], 
> russian_movie=[MIN($1)]): rowcount = 1.0, cumulative cost = 
> 1.0101010125097225E14, id = 30
>   LogicalProject(name=[$1], title=[$31]): rowcount = 120135.49804687499, 
> cumulative cost = 1.01010101250971E14, id = 29
>     LogicalFilter(condition=[AND(LIKE($11, '%(voice)%'), LIKE($11, 
> '%(uncredited)%'), =($16, '[ru]'), =($29, 'actor'), >($34, 2005), =($30, 
> $24), =($30, $9), =($9, $24), =($0, $10), =($28, $13), =($14, $25), =($21, 
> $26))]): rowcount = 120135.49804687499, cumulative cost = 
> 1.010101011308355E14, id = 26
>       LogicalJoin(condition=[true], joinType=[inner]): rowcount = 1.0E14, 
> cumulative cost = 1.010101010107E14, id = 25
>         LogicalJoin(condition=[true], joinType=[inner]): rowcount = 1.0E12, 
> cumulative cost = 1.0101010106E12, id = 21
>           LogicalJoin(condition=[true], joinType=[inner]): rowcount = 1.0E10, 
> cumulative cost = 1.01010105E10, id = 17
>             LogicalJoin(condition=[true], joinType=[inner]): rowcount = 
> 1.0E8, cumulative cost = 1.010104E8, id = 13
>               LogicalJoin(condition=[true], joinType=[inner]): rowcount = 
> 1000000.0, cumulative cost = 1010300.0, id = 9
>                 LogicalJoin(condition=[true], joinType=[inner]): rowcount = 
> 10000.0, cumulative cost = 10200.0, id = 5
>                   LogicalTableScan(table=[[postgres, char_name]]): rowcount = 
> 100.0, cumulative cost = 100.0, id = 1
>                   LogicalTableScan(table=[[postgres, cast_info]]): rowcount = 
> 100.0, cumulative cost = 100.0, id = 3
>                 LogicalTableScan(table=[[postgres, company_name]]): rowcount 
> = 100.0, cumulative cost = 100.0, id = 7
>               LogicalTableScan(table=[[postgres, company_type]]): rowcount = 
> 100.0, cumulative cost = 100.0, id = 11
>             LogicalTableScan(table=[[postgres, movie_companies]]): rowcount = 
> 100.0, cumulative cost = 100.0, id = 15
>           LogicalTableScan(table=[[postgres, role_type]]): rowcount = 100.0, 
> cumulative cost = 100.0, id = 19
>         LogicalTableScan(table=[[postgres, title]]): rowcount = 100.0, 
> cumulative cost = 100.0, id = 23
> 
> 
> I hope this specifies my problem a bit more.
> 
> Best,
> Juri
> 
> On 2025/03/26 13:54:33 Alessandro Solimando wrote:
> > Hi Juri,
> > it's true that the tables in the joins are fully connected via the
> > predicates, but order matters and the concrete order I see can't do without
> > cartesian products: it's joining "company_type" with other tables before
> > joining with "movie_companies", but the only predicate in the where clause
> > around "company_type" is "ct.id = mc.company_type_id", which can't be used
> > in that subtree as "movie_companies" hasn't been joined yet, so basically
> > it's a join ordering "issue" (which could not be an issue at all based on
> > the size of the tables, selectivity of the predicates etc.).
> >
> > Are you using rules for join ordering like LoptOptimizeJoinRule
> > <https://github.com/apache/calcite/blob/bfbe8930f4ed7ba8da530e862e212a057191cfa3/core/src/main/java/org/apache/calcite/rel/rules/LoptOptimizeJoinRule.java>
> > in your program (the set of rules you use could help people provide a
> > better answer)? If you are using 1.39.0 there is a new join ordering
> > algorithm, you can refer to CALCITE-6846
> > <https://issues.apache.org/jira/browse/CALCITE-6846> and related PR for
> > more details which should be exhaustive.
> >
> > If you think you have added all the rules and you can't still get a sense
> > of why you end up with a particular plan, you can activate the extended
> > logs around rule applications and transformations to be able to then put
> > breakpoints in the involved rules at the specific step which is generally
> > tricky as rules are called multiple times. You can refer to these slides
> > https://www.slideshare.net/StamatisZampetakis/debugging-planning-issues-using-calcites-builtin-loggers
> > (there is also the full video and other links at
> > https://calcite.apache.org/community/, the talk is "Debugging planning
> > issues using Calcite’s built in loggers").
> >
> > Best regards,
> > Alessandro
> >
> > On Wed, 26 Mar 2025 at 11:10, Juri Petersen <j...@itu.dk.invalid> wrote:
> >
> > > Hi,
> > > As mentioned by Mads in a previous mail, we are working on a SQL-API in
> > > Apache Wayang.
> > > We are trying to set up experiments with the JOB Benchmark and see that we
> > > have to rewrite queries to explicit INNER JOINS for them to be parsed
> > > correctly.
> > > Since we are planning to do other benchmarks with thousands of queries,
> > > rewriting is not feasible.
> > >
> > > Given this (not-rewritten) query from JOB:
> > >
> > > SELECT MIN(chn.name) AS uncredited_voiced_character,
> > >        MIN(t.title) AS russian_movie
> > > FROM postgres.char_name AS chn,
> > >      postgres.cast_info AS ci,
> > >      postgres.company_name AS cn,
> > >      postgres.company_type AS ct,
> > >      postgres.movie_companies AS mc,
> > >      postgres.role_type AS rt,
> > >      postgres.title AS t
> > > WHERE ci.note LIKE '%(voice)%'
> > >   AND ci.note LIKE '%(uncredited)%'
> > >   AND cn.country_code = '[ru]'
> > >   AND rt.role = 'actor'
> > >   AND t.production_year > 2005
> > >   AND t.id = mc.movie_id
> > >   AND t.id = ci.movie_id
> > >   AND ci.movie_id = mc.movie_id
> > >   AND chn.id = ci.person_role_id
> > >   AND rt.id = ci.role_id
> > >   AND cn.id = mc.company_id
> > >   AND ct.id = mc.company_type_id;
> > >
> > > We use calcite to get the following tree:
> > >
> > > LogicalAggregate(group=[{}], uncredited_voiced_character=[MIN($0)],
> > > russian_movie=[MIN($1)])
> > >   LogicalProject(name=[$1], title=[$31])
> > >     LogicalFilter(condition=[AND(LIKE($11, '%(voice)%'), LIKE($11,
> > > '%(uncredited)%'), =($16, '[ru]'), =($29, 'actor'), >($34, 2005), =($30,
> > > $24), =($30, $9), =($9, $24), =($0, $10), =($28, $13), =($14, $25), =($21,
> > > $26))])
> > >       LogicalJoin(condition=[true], joinType=[inner])
> > >         LogicalJoin(condition=[true], joinType=[inner])
> > >           LogicalJoin(condition=[true], joinType=[inner])
> > >             LogicalJoin(condition=[true], joinType=[inner])
> > >               LogicalJoin(condition=[true], joinType=[inner])
> > >                 LogicalJoin(condition=[true], joinType=[inner])
> > >                   LogicalTableScan(table=[[postgres, char_name]])
> > >                   LogicalTableScan(table=[[postgres, cast_info]])
> > >                 LogicalTableScan(table=[[postgres, company_name]])
> > >               LogicalTableScan(table=[[postgres, company_type]])
> > >             LogicalTableScan(table=[[postgres, movie_companies]])
> > >           LogicalTableScan(table=[[postgres, role_type]])
> > >         LogicalTableScan(table=[[postgres, title]])
> > >
> > >
> > > I then try to apply the CoreRules.FILTER_INTO_JOIN (tried smart and dumb
> > > version), in order to avoid the cartesian products, hoping to push the 
> > > join
> > > conditions into the respective LogicalJoins.
> > > Heres the resulting tree:
> > >
> > > LogicalAggregate(group=[{}], uncredited_voiced_character=[MIN($0)],
> > > russian_movie=[MIN($1)])
> > >   LogicalProject(name=[$1], title=[$31])
> > >     LogicalJoin(condition=[=($24, $30)], joinType=[inner])
> > >       LogicalJoin(condition=[=($28, $13)], joinType=[inner])
> > >         LogicalJoin(condition=[AND(=($9, $24), =($14, $25), =($21, $26))],
> > > joinType=[inner])
> > >           LogicalJoin(condition=[true], joinType=[inner])
> > >             LogicalJoin(condition=[true], joinType=[inner])
> > >               LogicalJoin(condition=[=($0, $10)], joinType=[inner])
> > >                 LogicalTableScan(table=[[postgres, char_name]])
> > >                 LogicalFilter(condition=[AND(LIKE($4, '%(voice)%'),
> > > LIKE($4, '%(uncredited)%'))])
> > >                   LogicalTableScan(table=[[postgres, cast_info]])
> > >               LogicalFilter(condition=[=($2, '[ru]')])
> > >                 LogicalTableScan(table=[[postgres, company_name]])
> > >             LogicalTableScan(table=[[postgres, company_type]])
> > >           LogicalTableScan(table=[[postgres, movie_companies]])
> > >         LogicalFilter(condition=[=($1, 'actor')])
> > >           LogicalTableScan(table=[[postgres, role_type]])
> > >       LogicalFilter(condition=[>($4, 2005)])
> > >         LogicalTableScan(table=[[postgres, title]])
> > >
> > > Some of the conditions are pushed down, but we still have remaining
> > > cartesian products and a multi-condition join.
> > > Looking at the input query, I would expect every Join to have a condition,
> > > as there are no unspecified joins, right?
> > > What am I missing or what can we do to deconstruct the multi-conditional
> > > join and avoid cartesian products?
> > >
> > > Thanks in advance for any help!
> > >
> > > Best,
> > > Juri
> > >
> > >
> >
>

Re: 回复: FIlterIntoJoinRule applied without complete result

Reply via email to