I investigated something similar a long time ago. We noticed that a
lot of trivial Project operators were being generated to compensate
for field re-ordering due to join transposition. And so the idea was
to allow each RelNode (and especially Join) to permute its output
fields.

Here is the case: https://issues.apache.org/jira/browse/CALCITE-62.
https://issues.apache.org/jira/browse/CALCITE-55 is related.

The problem, as I noted in CALCITE-62, is that people writing rules
have another mapping to deal with.

I believe that other systems, such as Spark's Catalyst planner, use
globally unique IDs for columns (as opposed to Calcite, whose column
references are only locally unique, ordinals of the input
operator(s)). Globally unique IDs would be superior for this problem
but perhaps introduce other challenges.

Julian

On Sun, Mar 7, 2021 at 11:25 PM Vladimir Ozerov <[email protected]> wrote:
>
> Hi,
>
> Currently, the order of attributes is used to define operators equivalence.
> This leads to several interesting problems, such as possible duplication of
> expressions (e.g., "a,a,b") or additional work in rules to detect trivial
> projects and/or input permutations (e.g. "a,b -> b,a").
>
> But the biggest problem is the join order planning. In Calcite, AxB is not
> equivalent to BxA:
>
>    1. It makes the ruleset [JoinCommuteRule, JoinAssociateRule]
>    insufficient to explore all bushy trees because the commute operation adds
>    a project on top of the new join (AxB -> project(BxA)), thus not
>    allowing for the associate rule to be executed on the upper join. The
>    solution is to add the ProjectJoinTransposeRule to the ruleset, but this
>    increases the planning time dramatically, making Apache Calcite unsuitable
>    for the cost-based join planning even for relatively simple join graphs.
>    2. It increases the number of join-related rule calls, which complicates
>    the implementation of the new join enumeration planning rules (e.g.
>    top-down enumerators) because duplicate derivations compromise performance.
>
> My question is - has the community considered an idea to make the order of
> columns a common property of all operators, somewhat similar to the trait,
> but without an explicit enforcer?
>
> For example, consider the following MEMO which the planner creates when
> trying to transform the join AxB to BxA:
>
> G1: { Scan1[table=t1, cols=a,b] }
> G2: { Scan2[table=t2, cols=c,d] }
> G3: { AxB[G1, G2], Project[G4, cols=$2,$3,$0,$1] }
> G4: { BxA[G2, G1] }
>
> However, if we make the column order part of the MEMO, we may potentially
> have something like this:
>
> G1: { Scan1[table=t1, cols=a,b] }
> G2: { Scan2[table=t2, cols=c,d] }
> G3 [cols=G1.$0, G1.$1, G2.$0, G2.$1]: { AxB[G1, G2], BxA[G2, G1] }
>
> Notice, how we were able to put AxB and BxA to the same equivalence group.
> To my knowledge, CockroachDB uses a somewhat similar design.
>
> I realize that this is rather a radical idea, and more design work is
> required to come with a proper proposal. At this point, I just would like
> to kindly ask the community to share high-level feedback on that. Were
> similar ideas proposed before?
>
> Thank you,
> Vladimir.

Reply via email to