Hi Hanumath, Thanks for putting this doc together; the examples are helpful and I find the approach interesting.
CTEs and redundant joins have similarities but I see them as different problems. CTEs are patterns that appear multiple times in a query but are not necessarily redundant. Apart from identifying CTEs I am actually pretty interested in how to represent them. It may be helpful to add some concrete examples/plans covering the representation part in the document you shared. A few weeks ago, I created a doc [1] with some preliminary ideas about CTEs in the context of Hive covering briefly: * representation * identification * logical -> physical for Hive It would be great if we can converge in a representation that is useful for multiple projects and built on top of that. The identification part is somewhat pluggable and it could be different depending on the use-case. Best, Stamatis [1] https://github.com/zabetak/docs/blob/main/2023-03-09-Hive-cbo-cte-optimization/index.md On Sun, Apr 30, 2023 at 10:08 PM Hanumath Maduri <[email protected]> wrote: > > Dear Calcite Devs, > > I would like to start prototyping the approach for the optimization > mentioned in (CALCITE-5631 > <https://issues.apache.org/jira/browse/CALCITE-5631>). So I just want to > follow up on the approach mentioned in the above document. > Please let me know if this approach seems reasonable to all of you. > > @Stamatis and @Julian, Please let me know if you got a chance to take a > look at the document and are in-line with the approach. > Any suggestions/clarifications on this approach would be very helpful. > > Thanks, > > > On Mon, Apr 24, 2023 at 7:25 PM Hanumath Maduri <[email protected]> > wrote: > > > > > Thanks Julian and Stamatis for sharing your thoughts on this optimization. > > I have jotted down a general approach which could optimize the common > > subexpression with join elimination use cases in the following document. > > Please go through this > > <https://docs.google.com/document/d/1rF6sZOipfXe2UdBViSbIF33FMXyz5Ev_s-q4IMe8gEI/edit?usp=sharing> > > document and let me know what you think of this approach. > > > > In gist the approach mentioned in this document is > > 1. Finding the common subexpressions > > 2. Using the subexpression information to eliminate the joins when > > possible. > > > > I did think about implementing this optimization in the decorrelator as > > well, but this approach was not explored further because it could miss > > finding non correlated common sub expressions. > > > > Let me know if you have any questions or suggestions on this approach. > > > > Thanks, > > Hanu > > > > > > > > > > On Wed, Apr 19, 2023 at 11:15 AM Julian Hyde <[email protected]> > > wrote: > > > >> Thank you, Stamatis. This is helpful. I have linked to this email thread > >> in CALCITE-5631. > >> > >> > On Apr 16, 2023, at 2:44 AM, Stamatis Zampetakis <[email protected]> > >> wrote: > >> > > >> > Few quick thoughts about this. > >> > > >> > For to the problem of query minimization/redundant joins the simpler > >> > scenarios that I can think of are the following: > >> > > >> > # Scenario A > >> > select e1.id from emp e1 inner join emp e2 on e1.name = e2.name; > >> > > >> > If you know the name column is UNIQUE then you can drop the join on e2. > >> > > >> > # Scenario B > >> > select e.name from emp e inner join dept d on e.dept = d.id; > >> > > >> > If you know that e.dept and d.id is a foreign key relationship then > >> > you can drop the join on dept. > >> > > >> > There are probably other cases to define and handle but we should move > >> > incrementally. > >> > > >> > As Julian pointed out, the issue logged in CALITE-5631 could also be > >> > addressed by employing common table expression related optimizations. > >> > CTE optimizations and query minimization are both interesting and > >> > powerful techniques to reduce the cost of the query (whatever that is > >> > speed, money, resources, etc). > >> > > >> > I would suggest focusing on query minimization first since it is > >> > pretty well defined and we could come up with solutions much faster > >> > than CTEs. CTEs usually come up with decisions about materializing or > >> > not the common expressions which are closer to lower level ("physical > >> > plan") optimizations. > >> > > >> > Most minimization techniques focus on select project join (SPJ) > >> > queries so I guess we would have to do some preprocessing to bring the > >> > plan in this format (only Scan, Project, Filter, and Join operators) > >> > before applying the rule. It would be a separate planning phase > >> > combining a bunch of existing rules followed by some new which is > >> > inline with what Julian was saying about bottom-up unification. > >> > > >> > The new rule could be something similar to LoptOptimizeJoinRule that > >> > operates on a MultiJoin. I haven't checked if the MultiJoin operator > >> > is sufficient to express an SPJ query but I think the general idea of > >> > grouping joins together seems to be a promising direction for writing > >> > new rules. > >> > > >> > Best, > >> > Stamatis > >> > > >> > On Sun, Apr 16, 2023 at 2:27 AM Julian Hyde <[email protected]> wrote: > >> >> > >> >> Ian Bertolacci recently logged > >> >> https://issues.apache.org/jira/browse/CALCITE-5631, to convert > >> >> > >> >> select > >> >> (select numarrayagg(C5633_203) from T893 where C5633_586 = > >> T895.id), > >> >> (select numarrayagg(C5633_170) from T893 where C5633_586 = T895.id) > >> >> from T895 > >> >> > >> >> into > >> >> > >> >> select agg.agg1, > >> >> agg.agg2 > >> >> from T895 > >> >> left join ( > >> >> select C5633_586, > >> >> numarrayagg(C5633_203) as agg1, > >> >> numarrayagg(C5633_170) as agg2 > >> >> from T893 > >> >> where C5633_586 is not null > >> >> group by C5633_586) as agg > >> >> on agg.C5633_586 = T895.id > >> >> > >> >> This seems to me an interesting and important problem. But it's also a > >> >> hard problem, and it's not clear to me which approach is the best. > >> >> Does anyone have any ideas for how to approach it? > >> >> > >> >> Also, we could use more example queries that illustrate the general > >> >> pattern. (Preferably in terms of simple databases such as EMP and > >> >> DEPT.) > >> >> > >> >> In Calcite rewrite rules (RelRule) are usually the preferred approach. > >> >> Because the common relational expressions scans can be an arbitrary > >> >> distance apart in the RelNode tree, RelRule doesn't seem suitable. > >> >> > >> >> There seem to be some similarities to algorithms to use materialized > >> >> views, which use bottom-up unification. > >> >> > >> >> Ian's original query actually has correlated scalar sub-queries rather > >> >> than explicit joins. Would it be better to target common sub-queries > >> >> rather than joins? > >> >> > >> >> Lastly, there are similarities with the WinMagic algorithm, which > >> >> converts correlated sub-queries into window aggregates. Is that a > >> >> useful direction? (My implementation of measures in CALCITE-4496 > >> >> naturally creates correlated scalar sub-queries that can be inlined in > >> >> the enclosing query if simple, or converted to window aggregates if > >> >> more complex.) > >> >> > >> >> Julian > >> > >>
