[ 
https://issues.apache.org/jira/browse/CALCITE-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Wu updated CALCITE-4920:
--------------------------------
    Description: 
Last year, we submit a PR, introducing the TopDownRuleDriver. The rule driver 
implements the top-down search strategy as suggested by the Cascades 
frameworks[1] and provides a basic branch and bound pruning mechanism according 
to the upper bound cost and lower bound cost as suggested by the Columbia 
paper[2].

However, the previous version of TopDownRuleDriver can only prune 
implementation rules and enforcement rules, not transformation rules. The 
reason is major about logical properties.
In the classic volcano/cascades model, logical properties, such as output row 
count, are properties that bind to an equivalent set and will never change 
during optimization. The Columbia optimizer[2] highly depends on this premise. 
However, calcite does not obey such rules. In calcite, logical properties of a 
RelSubset are likely to change during optimization. Actually, calcite is not 
the only optimizer engine that suffers. Orca's logical properties of an 
equivalent set also change. And it cannot have logical pruning, either.

How does the logical properties problem prevent logical pruning? Take this plan 
as an example: sink <- op1 <- op2 <- scan.
By applying a transformation rule, op1 <- op2 is transformed to op3 <- op4. So 
we get a new alternative plan, say sink <- op3 <- op4 <- scan, in which op3 is 
in the same equivalent set as op1.
After implementations and enforcements, the sub plan (op1 <- op2 <- scan) gets 
fully optimized and yield a winner with cost C1.
And now we are going to optimize op3. We know another plan in the same 
equivalent set has a cost of C1. So we can use C1 as a cost limit while 
optimizing op3. In the first step, we should build op3 into a physical plan, 
say impl-op3, and compute its self-cost as SC3.
Ideally, if SC3 is already greater than C1, then we can decide that op3 will 
never be part of the best plan, thus the optimization of op4 can be skipped. 
That's the basic though of group pruning in the Columbia optimizer[2].
Here comes the problem: when we calculate the self-cost of impl-op3, we need to 
leverage the metadata, like row count, of impl-op3, which will in turn ask 
impl-op3's input to derive its own metadata. However, the equivalent set of op4 
is not yet fully explored and its row count may not be the final one. So the 
self-cost of impl-op3 may be incorrect. If we just apply group pruning 
according to such cost, op4 will lost its opportunities to explore, and also 
the opportunities to become the best.

To ensure correctness, we require that all descendants are fully explored when 
calculating a node's cost. That's why our first version of TopDownRuleDriver 
only prunes implementation rules and enforcement rules.

In the passed one year, We tried some ways to solve the problem. For example, 
we tried to make calcite's logical properties stable, as Xiening proposed. But 
the proposal was rejected as the changes of metadata after transformations are 
natural. We also tried to identify, by categories or annotations, rules who 
will never change the logical properties and give up the pruning for other 
rules. But we still failed because it introduced too much complexity for rule 
designers.

Those failures drive us to consider the problem from the very essence: if we 
cannot make SC3 stable, what about we give up the usage of SC3 and leverage 
other costs for pruning?

Here is a simple description of the new though. After achieving C1, we eagerly 
build op3 and op4, without further exploration on them. Because op4's input, 
the scan, is fully optimized during the optimization of op1, we can compute a 
stable cumulative cost of impl-op4. Let's denote it as C4. And if we find that 
C4 is already greater than C1, then we know C4 will never be the best node and 
some optimization steps could be skipped (to make it simple, let impl-op4 be 
the only input of impl-op3):
1. The enforcement rules among impl-sink, impl-op3 and impl-op4, as well as 
trait pass-though. These steps are not handle properly in previous version.
2. The traits derivation of impl-op4 and impl-op3.
3. The explorations of op3, if the substitution of explorations always use op4 
as input. This is the key of logical pruning. I will explain it in more details 
later on.
Note that, the exploration of op4 is not pruned as we don't know whether op4's 
other alternatives would yield a lower cost. Moreover, the implementation of 
op3 is not skipped as it is already applied. But the implementation of other 
alternatives of op3 could be skipped if the exploration is pruned.

The new solution is a hybrid of top-down and bottom-up optimization. 
Optimization requests with cost limits are passed down in a top-down manner 
while cost propagation and pruning take place in a bottom-up manner. And it 
ensures that when a logical node is explored, it should already been built, if 
it could, and all its implementation's inputs are fully optimized. Besides 
better space pruning, this design also brings some other benefits:
1. When deriving traits, knowledge about inputs are stable and concrete. Trait 
derivation will only be executed once for each node (except for some corner 
cases). And the OMAKASE mode is fully supported.
2. No more cost improvement propagation in subset. And hence no more 
RelMetadataQuery cache invalidations.
3. Exploration rules and enforcement rules could use MetadataQuery without 
considering of regression. It is ensured that the input metadata is complete 
and stable.

According to the new thought, we implements a new TopDownRuleDriver. Note that 
the new version is very different from the previous one. So calcite community 
may decide whether to and how to merge it.

The assertion "if the substitution always use op4 as input" is easier to say 
than to implement. For example, op3 can transform to op5 via exploration rule, 
and op5<{-}op4 can transform to op6 <{-} op7. In this case, op6<-op7 does not 
use op4 as input any more. So the transformation from op3 to op5 should take 
place. How to identify them?
An ideal solution may be we looking into the rule set, building the networks of 
node transformations and determining that no rules will accept the pattern 
op5<-op4 and than deciding that the transformation of op3 to op5 could be 
skipped.
However, calcite's interface does not allow the rule driver to know what kinds 
of node a transformation may produce without invoking the rule. That is, we 
cannot know in previous that the exploration of op3 will produce op5 or any 
other nodes. So we use a simpler but weaker solution, we only check whether 
there are rules that match op4 as non-root operators (RelOptRuleOperand's 
ordinalInRule is not 0). This solution is not perfect, of course. There could 
be further improvements.
One may doubt the correctness the assumption, that op5 always have op4 as its 
input if there are no rules matching for xx<-op4. Exception do exists when a 
rule discards inputs. Such as the FilterReduceExpression rule transform the 
whole subtree into an empty Values when it finds the condition is always false. 
Luckily, I only see substitution rules have such behaviors, and substitute 
rules are always applied.
Another consideration is although we cannot identify whether an exploration 
will have subsequent exploration, experience tells us that it is quite rare, 
especially when substitute rules are already performed. So it is sometimes 
useful that we just give up such explorations for the consideration of 
performance. So we provide an option, the aggressivePruning option, with which 
turned on the pruning will consider no more subsequent effect of an 
exploration. Users may turn on this feature if they want better performance. 
Note that this option is default ON in the code, as we found little regressions 
during testings.

To make the pruning more efficient, we introduce another rule: when optimizing 
a subset, the subset with the default distribution and default collation will 
be optimized first. This rule does not introduce regressions considering the 
fact that there should always be a converter from the default subset to a 
concrete subset and the default subset always has a cost no larger than the 
concrete subset (because all rels in the concrete subset should belongs to the 
default subset). But it introduce lots of benefits:
 - When computes the lower bound of an exploration rules, we do not know which 
physical properties will the future implementation requires. So the default 
subset provides a lower bound cost.
 - Logical nodes could be completely ignored while optimizing other subsets as 
they must have already been optimized when optimizing the default subset.
 - We can still use the root cost of a subtree (the SC3 in the example) to 
prune enforcement rules because when applying enforcement rules, we know that 
all inputs are fully optimized.

Another interesting topic is introducing the parallel optimization like orca 
does. The new driver introduces a TaskScheduler that manages tasks. Ideally, it 
could record the dependencies between tasks and schedules leaf tasks in 
parallel. However, current version does not implement such a parallel scheduler 
because most data structure of VolcanoPlanner is thread safe. Community members 
who are interested in the parallel optimization could help improve it.

[1] 
[https://www.csd.uoc.gr/~hy460/pdf/CascadesFrameworkForQueryOptimization.pdf]
[2] [http://web.cecs.pdx.edu/~len/IDEAS01.pdf]

  was:
Last year, we submit a PR, introducing the TopDownRuleDriver. The rule driver 
implements the top-down search strategy as suggested by the Cascades 
frameworks[1] and provides a basic branch and bound pruning mechanism according 
to the upper bound cost and lower bound cost as suggested by the Columbia 
paper[2].

However, the previous version of TopDownRuleDriver can only prune 
implementation rules and enforcement rules, not transformation rules. The 
reason is major about logical properties.
In the classic volcano/cascades model, logical properties, such as output row 
count, are properties that bind to an equivalent set and will never change 
during optimization. The Columbia optimizer[2] highly depends on this premise. 
However, calcite does not obey such rules. In calcite, logical properties of a 
RelSubset are likely to change during optimization. Actually, calcite is not 
the only optimizer engine that suffers. Orca's logical properties of an 
equivalent set also change. And it cannot have logical pruning, either.

How does the logical properties problem prevent logical pruning? Take this plan 
as an example: sink <- op1 <- op2 <- scan.
By applying a transformation rule, op1 <- op2 is transformed to op3 <- op4. So 
we get a new alternative plan, say sink <- op3 <- op4 <- scan, in which op3 is 
in the same equivalent set as op1.
After implementations and enforcements, the sub plan (op1 <- op2 <- scan) gets 
fully optimized and yield a winner with cost C1.
And now we are going to optimize op3. We know another plan in the same 
equivalent set has a cost of C1. So we can use C1 as a cost limit while 
optimizing op3. In the first step, we should build op3 into a physical plan, 
say impl-op3, and compute its self-cost as SC3.
Ideally, if SC3 is already greater than C1, then we can decide that op3 will 
never be part of the best plan, thus the optimization of op4 can be skipped. 
That's the basic though of group pruning in the Columbia optimizer[2].
Here comes the problem: when we calculate the self-cost of impl-op3, we need to 
leverage the metadata, like row count, of impl-op3, which will in turn ask 
impl-op3's input to derive its own metadata. However, the equivalent set of op4 
is not yet fully explored and its row count may not be the final one. So the 
self-cost of impl-op3 may be incorrect. If we just apply group pruning 
according to such cost, op4 will lost its opportunities to explore, and also 
the opportunities to become the best.

To ensure correctness, we require that all descendants are fully explored when 
calculating a node's cost. That's why our first version of TopDownRuleDriver 
only prunes implementation rules and enforcement rules.

In the passed one year, We tried some ways to solve the problem. For example, 
we tried to make calcite's logical properties stable, as Xiening proposed. But 
the proposal was rejected as the changes of metadata after transformations are 
natural. We also tried to identify, by categories or annotations, rules who 
will never change the logical properties and give up the pruning for other 
rules. But we still failed because it introduced too much complexity for rule 
designers.

Those failures drive us to consider the problem from the very essence: if we 
cannot make SC3 stable, what about we give up the usage of SC3 and leverage 
other costs for pruning?

Here is a simple description of the new though. After achieving C1, we eagerly 
build op3 and op4, without further exploration on them. Because op4's input, 
the scan, is fully optimized during the optimization of op1, we can compute a 
stable cumulative cost of impl-op4. Let's denote it as C4. And if we find that 
C4 is already greater than C1, then we know C4 will never be the best node and 
some optimization steps could be skipped (to make it simple, let impl-op4 be 
the only input of impl-op3):
1. The enforcement rules among impl-sink, impl-op3 and impl-op4, as well as 
trait pass-though. These steps are not handle properly in previous version.
2. The traits derivation of impl-op4 and impl-op3.
3. The explorations of op3, if the substitution of explorations always use op4 
as input. This is the key of logical pruning. I will explain it in more details 
later on.
Note that, the exploration of op4 is not pruned as we don't know whether op4's 
other alternatives would yield a lower cost. Moreover, the implementation of 
op3 is not skipped as it is already applied. But the implementation of other 
alternatives of op3 could be skipped if the exploration is pruned.

The new solution is a hybrid of top-down and bottom-up optimization. 
Optimization requests with cost limits are passed down in a top-down manner 
while cost propagation and pruning take place in a bottom-up manner. And it 
ensures that when a logical node is explored, it should already been built, if 
it could, and all its implementation's inputs are fully optimized. Besides 
better space pruning, this design also brings some other benefits:
1. When deriving traits, knowledge about inputs are stable and concrete. Trait 
derivation will only be executed once for each node (except for some corner 
cases). And the OMAKASE mode is fully supported.
2. No more cost improvement propagation in subset. And hence no more 
RelMetadataQuery cache invalidations.
3. Exploration rules and enforcement rules could use MetadataQuery without 
considering of regression. It is ensured that the input metadata is complete 
and stable.

According to the new thought, we implements a new TopDownRuleDriver. Note that 
the new version is very different from the previous one. So calcite community 
may decide whether to and how to merge it.

The assertion "if the substitution always use op4 as input" is easier to say 
than to implement. For example, op3 can transform to op5 via exploration rule, 
and op5<-op4 can transform to op6 <- op7. In this case, op6<-op7 does not use 
op4 as input any more. So the transformation from op3 to op5 should take place. 
How to identify them?
An ideal solution may be we looking into the rule set, building the networks of 
node transformations and determining that no rules will accept the pattern 
op5<-op4 and than deciding that the transformation of op3 to op5 could be 
skipped.
However, calcite's interface does not allow the rule driver to know what kinds 
of node a transformation may produce without invoking the rule. That is, we 
cannot know in previous that the exploration of op3 will produce op5 or any 
other nodes. So we use a simpler but weaker solution, we only check whether 
there are rules that match op4 as non-root operators (RelOptRuleOperand's 
ordinalInRule is not 0). This solution is not perfect, of course. There could 
be further improvements.
One may doubt the correctness the assumption, that op5 always have op4 as its 
input if there are no rules matching for xx<-op4. Exception do exists when a 
rule discards inputs. Such as the FilterReduceExpression rule transform the 
whole subtree into an empty Values when it finds the condition is always false. 
Luckily, I only see substitution rules have such behaviors, and substitute 
rules are always applied.
Another consideration is although we cannot identify whether an exploration 
will have subsequent exploration, experience tells us that it is quite rare, 
especially when substitute rules are already performed. So it is sometimes 
useful that we just give up such explorations for the consideration of 
performance. So we provide an option, the aggressivePruning option, with which 
turned on the pruning will consider no more subsequent effect of an 
exploration. Users may turn on this feature if they want better performance. 
Note that this option is default ON in the code, as we found little regressions 
during testings.

To make the pruning more efficient, we introduce another rule: when optimizing 
a subset, the subset with the default distribution and default collation will 
be optimized first. This rule does not introduce regressions considering the 
fact that there should always be a converter from the default subset to a 
concrete subset and the default subset always has a cost no larger than the 
concrete subset (because all rels in the concrete subset should belongs to the 
default subset). But it introduce lots of benefits:
- When computes the lower bound of an exploration rules, we do not know which 
physical properties will the future implementation requires. So the default 
subset provides a lower bound cost.
- Logical nodes could be completely ignored while optimizing other subsets as 
they must have already been optimized when optimizing the default subset.
- We can still use the root cost of a subtree (the SC3 in the example) to prune 
enforcement rules because when applying enforcement rules, we know that all 
inputs are fully optimized.


Another interesting topic is introducing the parallel optimization like orca 
does. The new driver introduces a TaskScheduler that manages tasks. Ideally, it 
could records the dependencies between tasks and schedules leaf tasks in 
parallel. However, current version does not implement such a parallel scheduler 
because most data structure of VolcanoPlanner is thread safe. Community members 
who are interested in the parallel optimization could help improve it.

[1] https://www.csd.uoc.gr/~hy460/pdf/CascadesFrameworkForQueryOptimization.pdf
[2] http://web.cecs.pdx.edu/~len/IDEAS01.pdf


> Introduce logical space pruning to TopDownRuleDriver
> ----------------------------------------------------
>
>                 Key: CALCITE-4920
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4920
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Jinpeng Wu
>            Assignee: Jinpeng Wu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Last year, we submit a PR, introducing the TopDownRuleDriver. The rule driver 
> implements the top-down search strategy as suggested by the Cascades 
> frameworks[1] and provides a basic branch and bound pruning mechanism 
> according to the upper bound cost and lower bound cost as suggested by the 
> Columbia paper[2].
> However, the previous version of TopDownRuleDriver can only prune 
> implementation rules and enforcement rules, not transformation rules. The 
> reason is major about logical properties.
> In the classic volcano/cascades model, logical properties, such as output row 
> count, are properties that bind to an equivalent set and will never change 
> during optimization. The Columbia optimizer[2] highly depends on this 
> premise. However, calcite does not obey such rules. In calcite, logical 
> properties of a RelSubset are likely to change during optimization. Actually, 
> calcite is not the only optimizer engine that suffers. Orca's logical 
> properties of an equivalent set also change. And it cannot have logical 
> pruning, either.
> How does the logical properties problem prevent logical pruning? Take this 
> plan as an example: sink <- op1 <- op2 <- scan.
> By applying a transformation rule, op1 <- op2 is transformed to op3 <- op4. 
> So we get a new alternative plan, say sink <- op3 <- op4 <- scan, in which 
> op3 is in the same equivalent set as op1.
> After implementations and enforcements, the sub plan (op1 <- op2 <- scan) 
> gets fully optimized and yield a winner with cost C1.
> And now we are going to optimize op3. We know another plan in the same 
> equivalent set has a cost of C1. So we can use C1 as a cost limit while 
> optimizing op3. In the first step, we should build op3 into a physical plan, 
> say impl-op3, and compute its self-cost as SC3.
> Ideally, if SC3 is already greater than C1, then we can decide that op3 will 
> never be part of the best plan, thus the optimization of op4 can be skipped. 
> That's the basic though of group pruning in the Columbia optimizer[2].
> Here comes the problem: when we calculate the self-cost of impl-op3, we need 
> to leverage the metadata, like row count, of impl-op3, which will in turn ask 
> impl-op3's input to derive its own metadata. However, the equivalent set of 
> op4 is not yet fully explored and its row count may not be the final one. So 
> the self-cost of impl-op3 may be incorrect. If we just apply group pruning 
> according to such cost, op4 will lost its opportunities to explore, and also 
> the opportunities to become the best.
> To ensure correctness, we require that all descendants are fully explored 
> when calculating a node's cost. That's why our first version of 
> TopDownRuleDriver only prunes implementation rules and enforcement rules.
> In the passed one year, We tried some ways to solve the problem. For example, 
> we tried to make calcite's logical properties stable, as Xiening proposed. 
> But the proposal was rejected as the changes of metadata after 
> transformations are natural. We also tried to identify, by categories or 
> annotations, rules who will never change the logical properties and give up 
> the pruning for other rules. But we still failed because it introduced too 
> much complexity for rule designers.
> Those failures drive us to consider the problem from the very essence: if we 
> cannot make SC3 stable, what about we give up the usage of SC3 and leverage 
> other costs for pruning?
> Here is a simple description of the new though. After achieving C1, we 
> eagerly build op3 and op4, without further exploration on them. Because op4's 
> input, the scan, is fully optimized during the optimization of op1, we can 
> compute a stable cumulative cost of impl-op4. Let's denote it as C4. And if 
> we find that C4 is already greater than C1, then we know C4 will never be the 
> best node and some optimization steps could be skipped (to make it simple, 
> let impl-op4 be the only input of impl-op3):
> 1. The enforcement rules among impl-sink, impl-op3 and impl-op4, as well as 
> trait pass-though. These steps are not handle properly in previous version.
> 2. The traits derivation of impl-op4 and impl-op3.
> 3. The explorations of op3, if the substitution of explorations always use 
> op4 as input. This is the key of logical pruning. I will explain it in more 
> details later on.
> Note that, the exploration of op4 is not pruned as we don't know whether 
> op4's other alternatives would yield a lower cost. Moreover, the 
> implementation of op3 is not skipped as it is already applied. But the 
> implementation of other alternatives of op3 could be skipped if the 
> exploration is pruned.
> The new solution is a hybrid of top-down and bottom-up optimization. 
> Optimization requests with cost limits are passed down in a top-down manner 
> while cost propagation and pruning take place in a bottom-up manner. And it 
> ensures that when a logical node is explored, it should already been built, 
> if it could, and all its implementation's inputs are fully optimized. Besides 
> better space pruning, this design also brings some other benefits:
> 1. When deriving traits, knowledge about inputs are stable and concrete. 
> Trait derivation will only be executed once for each node (except for some 
> corner cases). And the OMAKASE mode is fully supported.
> 2. No more cost improvement propagation in subset. And hence no more 
> RelMetadataQuery cache invalidations.
> 3. Exploration rules and enforcement rules could use MetadataQuery without 
> considering of regression. It is ensured that the input metadata is complete 
> and stable.
> According to the new thought, we implements a new TopDownRuleDriver. Note 
> that the new version is very different from the previous one. So calcite 
> community may decide whether to and how to merge it.
> The assertion "if the substitution always use op4 as input" is easier to say 
> than to implement. For example, op3 can transform to op5 via exploration 
> rule, and op5<{-}op4 can transform to op6 <{-} op7. In this case, op6<-op7 
> does not use op4 as input any more. So the transformation from op3 to op5 
> should take place. How to identify them?
> An ideal solution may be we looking into the rule set, building the networks 
> of node transformations and determining that no rules will accept the pattern 
> op5<-op4 and than deciding that the transformation of op3 to op5 could be 
> skipped.
> However, calcite's interface does not allow the rule driver to know what 
> kinds of node a transformation may produce without invoking the rule. That 
> is, we cannot know in previous that the exploration of op3 will produce op5 
> or any other nodes. So we use a simpler but weaker solution, we only check 
> whether there are rules that match op4 as non-root operators 
> (RelOptRuleOperand's ordinalInRule is not 0). This solution is not perfect, 
> of course. There could be further improvements.
> One may doubt the correctness the assumption, that op5 always have op4 as its 
> input if there are no rules matching for xx<-op4. Exception do exists when a 
> rule discards inputs. Such as the FilterReduceExpression rule transform the 
> whole subtree into an empty Values when it finds the condition is always 
> false. Luckily, I only see substitution rules have such behaviors, and 
> substitute rules are always applied.
> Another consideration is although we cannot identify whether an exploration 
> will have subsequent exploration, experience tells us that it is quite rare, 
> especially when substitute rules are already performed. So it is sometimes 
> useful that we just give up such explorations for the consideration of 
> performance. So we provide an option, the aggressivePruning option, with 
> which turned on the pruning will consider no more subsequent effect of an 
> exploration. Users may turn on this feature if they want better performance. 
> Note that this option is default ON in the code, as we found little 
> regressions during testings.
> To make the pruning more efficient, we introduce another rule: when 
> optimizing a subset, the subset with the default distribution and default 
> collation will be optimized first. This rule does not introduce regressions 
> considering the fact that there should always be a converter from the default 
> subset to a concrete subset and the default subset always has a cost no 
> larger than the concrete subset (because all rels in the concrete subset 
> should belongs to the default subset). But it introduce lots of benefits:
>  - When computes the lower bound of an exploration rules, we do not know 
> which physical properties will the future implementation requires. So the 
> default subset provides a lower bound cost.
>  - Logical nodes could be completely ignored while optimizing other subsets 
> as they must have already been optimized when optimizing the default subset.
>  - We can still use the root cost of a subtree (the SC3 in the example) to 
> prune enforcement rules because when applying enforcement rules, we know that 
> all inputs are fully optimized.
> Another interesting topic is introducing the parallel optimization like orca 
> does. The new driver introduces a TaskScheduler that manages tasks. Ideally, 
> it could record the dependencies between tasks and schedules leaf tasks in 
> parallel. However, current version does not implement such a parallel 
> scheduler because most data structure of VolcanoPlanner is thread safe. 
> Community members who are interested in the parallel optimization could help 
> improve it.
> [1] 
> [https://www.csd.uoc.gr/~hy460/pdf/CascadesFrameworkForQueryOptimization.pdf]
> [2] [http://web.cecs.pdx.edu/~len/IDEAS01.pdf]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to