Few quick thoughts about this.
For to the problem of query minimization/redundant joins the simpler
scenarios that I can think of are the following:
# Scenario A
select e1.id from emp e1 inner join emp e2 on e1.name = e2.name;
If you know the name column is UNIQUE then you can drop the join on e2.
# Scenario B
select e.name from emp e inner join dept d on e.dept = d.id;
If you know that e.dept and d.id is a foreign key relationship then
you can drop the join on dept.
There are probably other cases to define and handle but we should move
incrementally.
As Julian pointed out, the issue logged in CALITE-5631 could also be
addressed by employing common table expression related optimizations.
CTE optimizations and query minimization are both interesting and
powerful techniques to reduce the cost of the query (whatever that is
speed, money, resources, etc).
I would suggest focusing on query minimization first since it is
pretty well defined and we could come up with solutions much faster
than CTEs. CTEs usually come up with decisions about materializing or
not the common expressions which are closer to lower level ("physical
plan") optimizations.
Most minimization techniques focus on select project join (SPJ)
queries so I guess we would have to do some preprocessing to bring the
plan in this format (only Scan, Project, Filter, and Join operators)
before applying the rule. It would be a separate planning phase
combining a bunch of existing rules followed by some new which is
inline with what Julian was saying about bottom-up unification.
The new rule could be something similar to LoptOptimizeJoinRule that
operates on a MultiJoin. I haven't checked if the MultiJoin operator
is sufficient to express an SPJ query but I think the general idea of
grouping joins together seems to be a promising direction for writing
new rules.
Best,
Stamatis
On Sun, Apr 16, 2023 at 2:27 AM Julian Hyde <[email protected]> wrote:
>
> Ian Bertolacci recently logged
> https://issues.apache.org/jira/browse/CALCITE-5631, to convert
>
> select
> (select numarrayagg(C5633_203) from T893 where C5633_586 = T895.id),
> (select numarrayagg(C5633_170) from T893 where C5633_586 = T895.id)
> from T895
>
> into
>
> select agg.agg1,
> agg.agg2
> from T895
> left join (
> select C5633_586,
> numarrayagg(C5633_203) as agg1,
> numarrayagg(C5633_170) as agg2
> from T893
> where C5633_586 is not null
> group by C5633_586) as agg
> on agg.C5633_586 = T895.id
>
> This seems to me an interesting and important problem. But it's also a
> hard problem, and it's not clear to me which approach is the best.
> Does anyone have any ideas for how to approach it?
>
> Also, we could use more example queries that illustrate the general
> pattern. (Preferably in terms of simple databases such as EMP and
> DEPT.)
>
> In Calcite rewrite rules (RelRule) are usually the preferred approach.
> Because the common relational expressions scans can be an arbitrary
> distance apart in the RelNode tree, RelRule doesn't seem suitable.
>
> There seem to be some similarities to algorithms to use materialized
> views, which use bottom-up unification.
>
> Ian's original query actually has correlated scalar sub-queries rather
> than explicit joins. Would it be better to target common sub-queries
> rather than joins?
>
> Lastly, there are similarities with the WinMagic algorithm, which
> converts correlated sub-queries into window aggregates. Is that a
> useful direction? (My implementation of measures in CALCITE-4496
> naturally creates correlated scalar sub-queries that can be inlined in
> the enclosing query if simple, or converted to window aggregates if
> more complex.)
>
> Julian