Re: Rewrite rule to convert self-joins into scans

Stamatis Zampetakis Mon, 01 May 2023 14:25:07 -0700

Hi Hanumath,

Thanks for putting this doc together; the examples are helpful and I
find the approach interesting.


CTEs and redundant joins have similarities but I see them as different
problems. CTEs are patterns that appear multiple times in a query but
are not necessarily redundant.

Apart from identifying CTEs I am actually pretty interested in how to
represent them. It may be helpful to add some concrete examples/plans
covering the representation part in the document you shared.

A few weeks ago, I created a doc [1] with some preliminary ideas about
CTEs in the context of Hive covering briefly:
* representation
* identification
* logical -> physical for Hive

It would be great if we can converge in a representation that is
useful for multiple projects and built on top of that. The
identification part is somewhat pluggable and it could be different
depending on the use-case.

Best,
Stamatis

[1] 
https://github.com/zabetak/docs/blob/main/2023-03-09-Hive-cbo-cte-optimization/index.md

On Sun, Apr 30, 2023 at 10:08 PM Hanumath Maduri
<[email protected]> wrote:
>
> Dear Calcite Devs,
>
> I would like to start prototyping the approach for the optimization
> mentioned in (CALCITE-5631
> <https://issues.apache.org/jira/browse/CALCITE-5631>). So I just want to
> follow up on the approach mentioned in the above document.
> Please let me know if this approach seems reasonable to all of you.
>
> @Stamatis and @Julian,  Please let me know if you got a chance to take a
> look at the document and are in-line with the approach.
> Any suggestions/clarifications on this approach would be very helpful.
>
> Thanks,
>
>
> On Mon, Apr 24, 2023 at 7:25 PM Hanumath Maduri <[email protected]>
> wrote:
>
> >
> > Thanks Julian and Stamatis for sharing your thoughts on this optimization.
> > I have jotted down a general approach which could optimize the common
> > subexpression with join elimination use cases in the following document.
> > Please go through this
> > <https://docs.google.com/document/d/1rF6sZOipfXe2UdBViSbIF33FMXyz5Ev_s-q4IMe8gEI/edit?usp=sharing>
> > document and let me know what you think of this approach.
> >
> > In gist the approach mentioned in this document is
> > 1. Finding the common subexpressions
> > 2. Using the subexpression information to eliminate the joins when
> > possible.
> >
> > I did think about implementing this optimization in the decorrelator as
> > well, but this approach was not explored further because it could miss
> > finding non correlated common sub expressions.
> >
> > Let me know if you have any questions or suggestions on this approach.
> >
> > Thanks,
> > Hanu
> >
> >
> >
> >
> > On Wed, Apr 19, 2023 at 11:15 AM Julian Hyde <[email protected]>
> > wrote:
> >
> >> Thank you, Stamatis. This is helpful. I have linked to this email thread
> >> in CALCITE-5631.
> >>
> >> > On Apr 16, 2023, at 2:44 AM, Stamatis Zampetakis <[email protected]>
> >> wrote:
> >> >
> >> > Few quick thoughts about this.
> >> >
> >> > For to the problem of query minimization/redundant joins the simpler
> >> > scenarios that I can think of are the following:
> >> >
> >> > # Scenario A
> >> > select e1.id from emp e1 inner join emp e2 on e1.name = e2.name;
> >> >
> >> > If you know the name column is UNIQUE then you can drop the join on e2.
> >> >
> >> > # Scenario B
> >> > select e.name from emp e inner join dept d on e.dept = d.id;
> >> >
> >> > If you know that e.dept and d.id is a foreign key relationship then
> >> > you can drop the join on dept.
> >> >
> >> > There are probably other cases to define and handle but we should move
> >> > incrementally.
> >> >
> >> > As Julian pointed out, the issue logged  in CALITE-5631 could also be
> >> > addressed by employing common table expression related optimizations.
> >> > CTE optimizations and query minimization are both interesting and
> >> > powerful techniques to reduce the cost of the query (whatever that is
> >> > speed, money, resources, etc).
> >> >
> >> > I would suggest focusing on query minimization first since it is
> >> > pretty well defined and we could come up with solutions much faster
> >> > than CTEs. CTEs usually come up with decisions about materializing or
> >> > not the common expressions which are closer to lower level ("physical
> >> > plan") optimizations.
> >> >
> >> > Most minimization techniques focus on select project join (SPJ)
> >> > queries so I guess we would have to do some preprocessing to bring the
> >> > plan in this format (only Scan, Project, Filter, and Join operators)
> >> > before applying the rule. It would be a separate planning phase
> >> > combining a bunch of existing rules followed by some new which is
> >> > inline with what Julian was saying about bottom-up unification.
> >> >
> >> > The new rule could be something similar to LoptOptimizeJoinRule that
> >> > operates on a MultiJoin. I haven't checked if the MultiJoin operator
> >> > is sufficient to express an SPJ query but I think the general idea of
> >> > grouping joins together seems to be a promising direction for writing
> >> > new rules.
> >> >
> >> > Best,
> >> > Stamatis
> >> >
> >> > On Sun, Apr 16, 2023 at 2:27 AM Julian Hyde <[email protected]> wrote:
> >> >>
> >> >> Ian Bertolacci recently logged
> >> >> https://issues.apache.org/jira/browse/CALCITE-5631, to convert
> >> >>
> >> >>  select
> >> >>     (select numarrayagg(C5633_203) from T893 where C5633_586 =
> >> T895.id),
> >> >>     (select numarrayagg(C5633_170) from T893 where C5633_586 = T895.id)
> >> >>  from T895
> >> >>
> >> >> into
> >> >>
> >> >>  select agg.agg1,
> >> >>      agg.agg2
> >> >>  from T895
> >> >>  left join (
> >> >>    select C5633_586,
> >> >>        numarrayagg(C5633_203) as agg1,
> >> >>        numarrayagg(C5633_170) as agg2
> >> >>    from T893
> >> >>    where C5633_586 is not null
> >> >>    group by C5633_586) as agg
> >> >>  on agg.C5633_586 = T895.id
> >> >>
> >> >> This seems to me an interesting and important problem. But it's also a
> >> >> hard problem, and it's not clear to me which approach is the best.
> >> >> Does anyone have any ideas for how to approach it?
> >> >>
> >> >> Also, we could use more example queries that illustrate the general
> >> >> pattern.  (Preferably in terms of simple databases such as EMP and
> >> >> DEPT.)
> >> >>
> >> >> In Calcite rewrite rules (RelRule) are usually the preferred approach.
> >> >> Because the common relational expressions scans can be an arbitrary
> >> >> distance apart in the RelNode tree, RelRule doesn't seem suitable.
> >> >>
> >> >> There seem to be some similarities to algorithms to use materialized
> >> >> views, which use bottom-up unification.
> >> >>
> >> >> Ian's original query actually has correlated scalar sub-queries rather
> >> >> than explicit joins. Would it be better to target common sub-queries
> >> >> rather than joins?
> >> >>
> >> >> Lastly, there are similarities with the WinMagic algorithm, which
> >> >> converts correlated sub-queries into window aggregates. Is that a
> >> >> useful direction? (My implementation of measures in CALCITE-4496
> >> >> naturally creates correlated scalar sub-queries that can be inlined in
> >> >> the enclosing query if simple, or converted to window aggregates if
> >> >> more complex.)
> >> >>
> >> >> Julian
> >>
> >>

Re: Rewrite rule to convert self-joins into scans

Reply via email to