Jinpeng Wu created CALCITE-4920:
-----------------------------------

             Summary: Introduce logical space pruning to TopDownRuleDriver
                 Key: CALCITE-4920
                 URL: https://issues.apache.org/jira/browse/CALCITE-4920
             Project: Calcite
          Issue Type: Improvement
            Reporter: Jinpeng Wu
            Assignee: Jinpeng Wu


Last year, we submit a PR, introducing the TopDownRuleDriver. The rule driver 
implements the top-down search strategy as suggested by the Cascades 
frameworks[1] and provides a basic branch and bound pruning mechanism according 
to the upper bound cost and lower bound cost as suggested by the Columbia 
paper[2].

However, the previous version of TopDownRuleDriver can only prune 
implementation rules and enforcement rules, not transformation rules. The 
reason is major about logical properties.
In the classic volcano/cascades model, logical properties, such as output row 
count, are properties that bind to an equivalent set and will never change 
during optimization. The Columbia optimizer[2] highly depends on this premise. 
However, calcite does not obey such rules. In calcite, logical properties of a 
RelSubset are likely to change during optimization. Actually, calcite is not 
the only optimizer engine that suffers. Orca's logical properties of an 
equivalent set also change. And it cannot have logical pruning, either.

How does the logical properties problem prevent logical pruning? Take this plan 
as an example: sink <- op1 <- op2 <- scan.
By applying a transformation rule, op1 <- op2 is transformed to op3 <- op4. So 
we get a new alternative plan, say sink <- op3 <- op4 <- scan, in which op3 is 
in the same equivalent set as op1.
After implementations and enforcements, the sub plan (op1 <- op2 <- scan) gets 
fully optimized and yield a winner with cost C1.
And now we are going to optimize op3. We know another plan in the same 
equivalent set has a cost of C1. So we can use C1 as a cost limit while 
optimizing op3. In the first step, we should build op3 into a physical plan, 
say impl-op3, and compute its self-cost as SC3.
Ideally, if SC3 is already greater than C1, then we can decide that op3 will 
never be part of the best plan, thus the optimization of op4 can be skipped. 
That's the basic though of group pruning in the Columbia optimizer[2].
Here comes the problem: when we calculate the self-cost of impl-op3, we need to 
leverage the metadata, like row count, of impl-op3, which will in turn ask 
impl-op3's input to derive its own metadata. However, the equivalent set of op4 
is not yet fully explored and its row count may not be the final one. So the 
self-cost of impl-op3 may be incorrect. If we just apply group pruning 
according to such cost, op4 will lost its opportunities to explore, and also 
the opportunities to become the best.

To ensure correctness, we require that all descendants are fully explored when 
calculating a node's cost. That's why our first version of TopDownRuleDriver 
only prunes implementation rules and enforcement rules.

In the passed one year, We tried some ways to solve the problem. For example, 
we tried to make calcite's logical properties stable, as Xiening proposed. But 
the proposal was rejected as the changes of metadata after transformations are 
natural. We also tried to identify, by categories or annotations, rules who 
will never change the logical properties and give up the pruning for other 
rules. But we still failed because it introduced too much complexity for rule 
designers.

Those failures drive us to consider the problem from the very essence: if we 
cannot make SC3 stable, what about we give up the usage of SC3 and leverage 
other costs for pruning?

Here is a simple description of the new though. After achieving C1, we eagerly 
build op3 and op4, without further exploration on them. Because op4's input, 
the scan, is fully optimized during the optimization of op1, we can compute a 
stable cumulative cost of impl-op4. Let's denote it as C4. And if we find that 
C4 is already greater than C1, then we know C4 will never be the best node and 
some optimization steps could be skipped (to make it simple, let impl-op4 be 
the only input of impl-op3):
1. The enforcement rules among impl-sink, impl-op3 and impl-op4, as well as 
trait pass-though. These steps are not handle properly in previous version.
2. The traits derivation of impl-op4 and impl-op3.
3. The explorations of op3, if the substitution of explorations always use op4 
as input. This is the key of logical pruning. I will explain it in more details 
later on.
Note that, the exploration of op4 is not pruned as we don't know whether op4's 
other alternatives would yield a lower cost. Moreover, the implementation of 
op3 is not skipped as it is already applied. But the implementation of other 
alternatives of op3 could be skipped if the exploration is pruned.

The new solution is a hybrid of top-down and bottom-up optimization. 
Optimization requests with cost limits are passed down in a top-down manner 
while cost propagation and pruning take place in a bottom-up manner. And it 
ensures that when a logical node is explored, it should already been built, if 
it could, and all its implementation's inputs are fully optimized. Besides 
better space pruning, this design also brings some other benefits:
1. When deriving traits, knowledge about inputs are stable and concrete. Trait 
derivation will only be executed once for each node (except for some corner 
cases). And the OMAKASE mode is fully supported.
2. No more cost improvement propagation in subset. And hence no more 
RelMetadataQuery cache invalidations.
3. Exploration rules and enforcement rules could use MetadataQuery without 
considering of regression. It is ensured that the input metadata is complete 
and stable.

According to the new thought, we implements a new TopDownRuleDriver. Note that 
the new version is very different from the previous one. So calcite community 
may decide whether to and how to merge it.

The assertion "if the substitution always use op4 as input" is easier to say 
than to implement. For example, op3 can transform to op5 via exploration rule, 
and op5<-op4 can transform to op6 <- op7. In this case, op6<-op7 does not use 
op4 as input any more. So the transformation from op3 to op5 should take place. 
How to identify them?
An ideal solution may be we looking into the rule set, building the networks of 
node transformations and determining that no rules will accept the pattern 
op5<-op4 and than deciding that the transformation of op3 to op5 could be 
skipped.
However, calcite's interface does not allow the rule driver to know what kinds 
of node a transformation may produce without invoking the rule. That is, we 
cannot know in previous that the exploration of op3 will produce op5 or any 
other nodes. So we use a simpler but weaker solution, we only check whether 
there are rules that match op4 as non-root operators (RelOptRuleOperand's 
ordinalInRule is not 0). This solution is not perfect, of course. There could 
be further improvements.
One may doubt the correctness the assumption, that op5 always have op4 as its 
input if there are no rules matching for xx<-op4. Exception do exists when a 
rule discards inputs. Such as the FilterReduceExpression rule transform the 
whole subtree into an empty Values when it finds the condition is always false. 
Luckily, I only see substitution rules have such behaviors, and substitute 
rules are always applied.
Another consideration is although we cannot identify whether an exploration 
will have subsequent exploration, experience tells us that it is quite rare, 
especially when substitute rules are already performed. So it is sometimes 
useful that we just give up such explorations for the consideration of 
performance. So we provide an option, the aggressivePruning option, with which 
turned on the pruning will consider no more subsequent effect of an 
exploration. Users may turn on this feature if they want better performance. 
Note that this option is default ON in the code, as we found little regressions 
during testings.

To make the pruning more efficient, we introduce another rule: when optimizing 
a subset, the subset with the default distribution and default collation will 
be optimized first. This rule does not introduce regressions considering the 
fact that there should always be a converter from the default subset to a 
concrete subset and the default subset always has a cost no larger than the 
concrete subset (because all rels in the concrete subset should belongs to the 
default subset). But it introduce lots of benefits:
- When computes the lower bound of an exploration rules, we do not know which 
physical properties will the future implementation requires. So the default 
subset provides a lower bound cost.
- Logical nodes could be completely ignored while optimizing other subsets as 
they must have already been optimized when optimizing the default subset.
- We can still use the root cost of a subtree (the SC3 in the example) to prune 
enforcement rules because when applying enforcement rules, we know that all 
inputs are fully optimized.


Another interesting topic is introducing the parallel optimization like orca 
does. The new driver introduces a TaskScheduler that manages tasks. Ideally, it 
could records the dependencies between tasks and schedules leaf tasks in 
parallel. However, current version does not implement such a parallel scheduler 
because most data structure of VolcanoPlanner is thread safe. Community members 
who are interested in the parallel optimization could help improve it.

[1] https://www.csd.uoc.gr/~hy460/pdf/CascadesFrameworkForQueryOptimization.pdf
[2] http://web.cecs.pdx.edu/~len/IDEAS01.pdf



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to