Jinpeng Wu created CALCITE-4920:
-----------------------------------
Summary: Introduce logical space pruning to TopDownRuleDriver
Key: CALCITE-4920
URL: https://issues.apache.org/jira/browse/CALCITE-4920
Project: Calcite
Issue Type: Improvement
Reporter: Jinpeng Wu
Assignee: Jinpeng Wu
Last year, we submit a PR, introducing the TopDownRuleDriver. The rule driver
implements the top-down search strategy as suggested by the Cascades
frameworks[1] and provides a basic branch and bound pruning mechanism according
to the upper bound cost and lower bound cost as suggested by the Columbia
paper[2].
However, the previous version of TopDownRuleDriver can only prune
implementation rules and enforcement rules, not transformation rules. The
reason is major about logical properties.
In the classic volcano/cascades model, logical properties, such as output row
count, are properties that bind to an equivalent set and will never change
during optimization. The Columbia optimizer[2] highly depends on this premise.
However, calcite does not obey such rules. In calcite, logical properties of a
RelSubset are likely to change during optimization. Actually, calcite is not
the only optimizer engine that suffers. Orca's logical properties of an
equivalent set also change. And it cannot have logical pruning, either.
How does the logical properties problem prevent logical pruning? Take this plan
as an example: sink <- op1 <- op2 <- scan.
By applying a transformation rule, op1 <- op2 is transformed to op3 <- op4. So
we get a new alternative plan, say sink <- op3 <- op4 <- scan, in which op3 is
in the same equivalent set as op1.
After implementations and enforcements, the sub plan (op1 <- op2 <- scan) gets
fully optimized and yield a winner with cost C1.
And now we are going to optimize op3. We know another plan in the same
equivalent set has a cost of C1. So we can use C1 as a cost limit while
optimizing op3. In the first step, we should build op3 into a physical plan,
say impl-op3, and compute its self-cost as SC3.
Ideally, if SC3 is already greater than C1, then we can decide that op3 will
never be part of the best plan, thus the optimization of op4 can be skipped.
That's the basic though of group pruning in the Columbia optimizer[2].
Here comes the problem: when we calculate the self-cost of impl-op3, we need to
leverage the metadata, like row count, of impl-op3, which will in turn ask
impl-op3's input to derive its own metadata. However, the equivalent set of op4
is not yet fully explored and its row count may not be the final one. So the
self-cost of impl-op3 may be incorrect. If we just apply group pruning
according to such cost, op4 will lost its opportunities to explore, and also
the opportunities to become the best.
To ensure correctness, we require that all descendants are fully explored when
calculating a node's cost. That's why our first version of TopDownRuleDriver
only prunes implementation rules and enforcement rules.
In the passed one year, We tried some ways to solve the problem. For example,
we tried to make calcite's logical properties stable, as Xiening proposed. But
the proposal was rejected as the changes of metadata after transformations are
natural. We also tried to identify, by categories or annotations, rules who
will never change the logical properties and give up the pruning for other
rules. But we still failed because it introduced too much complexity for rule
designers.
Those failures drive us to consider the problem from the very essence: if we
cannot make SC3 stable, what about we give up the usage of SC3 and leverage
other costs for pruning?
Here is a simple description of the new though. After achieving C1, we eagerly
build op3 and op4, without further exploration on them. Because op4's input,
the scan, is fully optimized during the optimization of op1, we can compute a
stable cumulative cost of impl-op4. Let's denote it as C4. And if we find that
C4 is already greater than C1, then we know C4 will never be the best node and
some optimization steps could be skipped (to make it simple, let impl-op4 be
the only input of impl-op3):
1. The enforcement rules among impl-sink, impl-op3 and impl-op4, as well as
trait pass-though. These steps are not handle properly in previous version.
2. The traits derivation of impl-op4 and impl-op3.
3. The explorations of op3, if the substitution of explorations always use op4
as input. This is the key of logical pruning. I will explain it in more details
later on.
Note that, the exploration of op4 is not pruned as we don't know whether op4's
other alternatives would yield a lower cost. Moreover, the implementation of
op3 is not skipped as it is already applied. But the implementation of other
alternatives of op3 could be skipped if the exploration is pruned.
The new solution is a hybrid of top-down and bottom-up optimization.
Optimization requests with cost limits are passed down in a top-down manner
while cost propagation and pruning take place in a bottom-up manner. And it
ensures that when a logical node is explored, it should already been built, if
it could, and all its implementation's inputs are fully optimized. Besides
better space pruning, this design also brings some other benefits:
1. When deriving traits, knowledge about inputs are stable and concrete. Trait
derivation will only be executed once for each node (except for some corner
cases). And the OMAKASE mode is fully supported.
2. No more cost improvement propagation in subset. And hence no more
RelMetadataQuery cache invalidations.
3. Exploration rules and enforcement rules could use MetadataQuery without
considering of regression. It is ensured that the input metadata is complete
and stable.
According to the new thought, we implements a new TopDownRuleDriver. Note that
the new version is very different from the previous one. So calcite community
may decide whether to and how to merge it.
The assertion "if the substitution always use op4 as input" is easier to say
than to implement. For example, op3 can transform to op5 via exploration rule,
and op5<-op4 can transform to op6 <- op7. In this case, op6<-op7 does not use
op4 as input any more. So the transformation from op3 to op5 should take place.
How to identify them?
An ideal solution may be we looking into the rule set, building the networks of
node transformations and determining that no rules will accept the pattern
op5<-op4 and than deciding that the transformation of op3 to op5 could be
skipped.
However, calcite's interface does not allow the rule driver to know what kinds
of node a transformation may produce without invoking the rule. That is, we
cannot know in previous that the exploration of op3 will produce op5 or any
other nodes. So we use a simpler but weaker solution, we only check whether
there are rules that match op4 as non-root operators (RelOptRuleOperand's
ordinalInRule is not 0). This solution is not perfect, of course. There could
be further improvements.
One may doubt the correctness the assumption, that op5 always have op4 as its
input if there are no rules matching for xx<-op4. Exception do exists when a
rule discards inputs. Such as the FilterReduceExpression rule transform the
whole subtree into an empty Values when it finds the condition is always false.
Luckily, I only see substitution rules have such behaviors, and substitute
rules are always applied.
Another consideration is although we cannot identify whether an exploration
will have subsequent exploration, experience tells us that it is quite rare,
especially when substitute rules are already performed. So it is sometimes
useful that we just give up such explorations for the consideration of
performance. So we provide an option, the aggressivePruning option, with which
turned on the pruning will consider no more subsequent effect of an
exploration. Users may turn on this feature if they want better performance.
Note that this option is default ON in the code, as we found little regressions
during testings.
To make the pruning more efficient, we introduce another rule: when optimizing
a subset, the subset with the default distribution and default collation will
be optimized first. This rule does not introduce regressions considering the
fact that there should always be a converter from the default subset to a
concrete subset and the default subset always has a cost no larger than the
concrete subset (because all rels in the concrete subset should belongs to the
default subset). But it introduce lots of benefits:
- When computes the lower bound of an exploration rules, we do not know which
physical properties will the future implementation requires. So the default
subset provides a lower bound cost.
- Logical nodes could be completely ignored while optimizing other subsets as
they must have already been optimized when optimizing the default subset.
- We can still use the root cost of a subtree (the SC3 in the example) to prune
enforcement rules because when applying enforcement rules, we know that all
inputs are fully optimized.
Another interesting topic is introducing the parallel optimization like orca
does. The new driver introduces a TaskScheduler that manages tasks. Ideally, it
could records the dependencies between tasks and schedules leaf tasks in
parallel. However, current version does not implement such a parallel scheduler
because most data structure of VolcanoPlanner is thread safe. Community members
who are interested in the parallel optimization could help improve it.
[1] https://www.csd.uoc.gr/~hy460/pdf/CascadesFrameworkForQueryOptimization.pdf
[2] http://web.cecs.pdx.edu/~len/IDEAS01.pdf
--
This message was sent by Atlassian Jira
(v8.20.1#820001)