Thanks Julian and Stamatis for sharing your thoughts on this optimization.
I have jotted down a general approach which could optimize the common
subexpression with join elimination use cases in the following document.
Please go through this
<https://docs.google.com/document/d/1rF6sZOipfXe2UdBViSbIF33FMXyz5Ev_s-q4IMe8gEI/edit?usp=sharing>
document and let me know what you think of this approach.

In gist the approach mentioned in this document is
1. Finding the common subexpressions
2. Using the subexpression information to eliminate the joins when possible.

I did think about implementing this optimization in the decorrelator as
well, but this approach was not explored further because it could miss
finding non correlated common sub expressions.

Let me know if you have any questions or suggestions on this approach.

Thanks,
Hanu




On Wed, Apr 19, 2023 at 11:15 AM Julian Hyde <[email protected]> wrote:

> Thank you, Stamatis. This is helpful. I have linked to this email thread
> in CALCITE-5631.
>
> > On Apr 16, 2023, at 2:44 AM, Stamatis Zampetakis <[email protected]>
> wrote:
> >
> > Few quick thoughts about this.
> >
> > For to the problem of query minimization/redundant joins the simpler
> > scenarios that I can think of are the following:
> >
> > # Scenario A
> > select e1.id from emp e1 inner join emp e2 on e1.name = e2.name;
> >
> > If you know the name column is UNIQUE then you can drop the join on e2.
> >
> > # Scenario B
> > select e.name from emp e inner join dept d on e.dept = d.id;
> >
> > If you know that e.dept and d.id is a foreign key relationship then
> > you can drop the join on dept.
> >
> > There are probably other cases to define and handle but we should move
> > incrementally.
> >
> > As Julian pointed out, the issue logged  in CALITE-5631 could also be
> > addressed by employing common table expression related optimizations.
> > CTE optimizations and query minimization are both interesting and
> > powerful techniques to reduce the cost of the query (whatever that is
> > speed, money, resources, etc).
> >
> > I would suggest focusing on query minimization first since it is
> > pretty well defined and we could come up with solutions much faster
> > than CTEs. CTEs usually come up with decisions about materializing or
> > not the common expressions which are closer to lower level ("physical
> > plan") optimizations.
> >
> > Most minimization techniques focus on select project join (SPJ)
> > queries so I guess we would have to do some preprocessing to bring the
> > plan in this format (only Scan, Project, Filter, and Join operators)
> > before applying the rule. It would be a separate planning phase
> > combining a bunch of existing rules followed by some new which is
> > inline with what Julian was saying about bottom-up unification.
> >
> > The new rule could be something similar to LoptOptimizeJoinRule that
> > operates on a MultiJoin. I haven't checked if the MultiJoin operator
> > is sufficient to express an SPJ query but I think the general idea of
> > grouping joins together seems to be a promising direction for writing
> > new rules.
> >
> > Best,
> > Stamatis
> >
> > On Sun, Apr 16, 2023 at 2:27 AM Julian Hyde <[email protected]> wrote:
> >>
> >> Ian Bertolacci recently logged
> >> https://issues.apache.org/jira/browse/CALCITE-5631, to convert
> >>
> >>  select
> >>     (select numarrayagg(C5633_203) from T893 where C5633_586 = T895.id),
> >>     (select numarrayagg(C5633_170) from T893 where C5633_586 = T895.id)
> >>  from T895
> >>
> >> into
> >>
> >>  select agg.agg1,
> >>      agg.agg2
> >>  from T895
> >>  left join (
> >>    select C5633_586,
> >>        numarrayagg(C5633_203) as agg1,
> >>        numarrayagg(C5633_170) as agg2
> >>    from T893
> >>    where C5633_586 is not null
> >>    group by C5633_586) as agg
> >>  on agg.C5633_586 = T895.id
> >>
> >> This seems to me an interesting and important problem. But it's also a
> >> hard problem, and it's not clear to me which approach is the best.
> >> Does anyone have any ideas for how to approach it?
> >>
> >> Also, we could use more example queries that illustrate the general
> >> pattern.  (Preferably in terms of simple databases such as EMP and
> >> DEPT.)
> >>
> >> In Calcite rewrite rules (RelRule) are usually the preferred approach.
> >> Because the common relational expressions scans can be an arbitrary
> >> distance apart in the RelNode tree, RelRule doesn't seem suitable.
> >>
> >> There seem to be some similarities to algorithms to use materialized
> >> views, which use bottom-up unification.
> >>
> >> Ian's original query actually has correlated scalar sub-queries rather
> >> than explicit joins. Would it be better to target common sub-queries
> >> rather than joins?
> >>
> >> Lastly, there are similarities with the WinMagic algorithm, which
> >> converts correlated sub-queries into window aggregates. Is that a
> >> useful direction? (My implementation of measures in CALCITE-4496
> >> naturally creates correlated scalar sub-queries that can be inlined in
> >> the enclosing query if simple, or converted to window aggregates if
> >> more complex.)
> >>
> >> Julian
>
>

Reply via email to