Hi colleagues,

I would like to discuss with the community the possibility of adding a new
public method to VolcanoPlanner which will forcefully re-trigger the rules
for the specific rel. This is a follow up of a discussion started in the
other thread [1].

**Problem statement**
When converting between conventions during optimization VolcanoPlanner
prefers the top-bottom approach, so that the nodes are converted from the
root. But in some cases, the intermediate node must be converted after its
children. This is especially true for distributed SQL engines, which rely
on distribution traits during the optimization process: it is not possible
to efficiently choose a proper physical implementation of a parent node
unless the physical representation of a child node is known.

It seems that presently VolcanoPlanner cannot address such cases by
default. Consider that we have two nodes and associated rules which convert
them to physical counterparts:
[LogicalParent <- LogicalChild]
The parent should be converted after the child. When the child is
converted, the physical node is created:
[LogicalParent <- {LogicalChild, PhysicalChild}]
In order to finish the optimization process, we need to convert the parent.
But parent rules are not fired, because PhysicalChild has traits
incompatible with LogicalParent.

Presently the problem could be solved in two ways:
1) Always produce conversions when going top-down. In this case, I first
create a physical parent, then a physical child. The problem is that
created parent is not optimal because it didn't know child distribution at
the time of creation. So the full flow would be: create a bad parent,
create a good child, create a good parent.
1.1) [LogicalParent <- LogicalChild]
1.2) [{LogicalParent, PhysicalParentBad} <- LogicalChild]
1.3) [{LogicalParent, PhysicalParentBad} <- {LogicalChild, PhysicalChild}]
1.4) [{LogicalParent, PhysicalParentBad, PhysicalParentGood} <-
{LogicalChild, PhysicalChild}]
What is worse, the creation of a not optimal parent will trigger rules on
its parent, which in turn may create a not optimal parent-parent node, etc.

2) Make sure that your convention returns true for
Convention.canConvertConvention. In this case, every time a new rel is
added to a RelSet, its traitset will be compared to traitsets of all other
rels in the set. For every pair of traitset we may ask the engine to create
a relevant AbstractConverter. The net effect is that "physical-to-logical"
converter is created, which re-triggers the rule on the logical parent
since their conventions are compatible:
2.1) [LogicalParent  <- LogicalChild]
2.2) [LogicalParent <- {LogicalChild, PhysicalChild}]
2.3) [LogicalParent <- {LogicalChild, PhysicalChild,
PhysicalToLogicalConverter}]
2.4) [{LogicalParent, PhysicalParent} <- {LogicalChild, PhysicalChild,
PhysicalToLogicalConverter}]

The problem with that approach is that it is too coarse-grained since we
operate on traitsets rather than rels. As a result, additional memory and
CPU pressure are introduced because usually too many converters are
created, which triggers the rules over and over again.

**Affected products**
At the moment two distributed engines are being developed for Hazelcast and
Apache Ignite. Both require bottom-up optimization and currently rely on
the second workaround.
Another example is Apache Drill. I do not know whether its community is
concerned with the issue, but it also uses bottom-up optimization for many
rules and employs both the aforementioned workarounds. As a result, I guess
that Drill's optimizer also creates too many rels during optimization and
suffer from huge search space. Please correct me if I am wrong.

**Proposal**
The key problem is that there is no way to re-fire rules on a specific
node. The original problem could have been solved, if it would be possible
to re-fire rules on a LogicalParent without creating any additional rels.
That would lead to a clear optimization flow:
2.1) [LogicalParent  <- LogicalChild]
2.2) [LogicalParent <- {LogicalChild, PhysicalChild}]
2.3) [{LogicalParent, PhysicalParent} <- {LogicalChild, PhysicalChild}]

We can add the following method to VolcanoPlanner (RelOptPlanner?)
interface:
void fireRules(RelNode rel)

This method will fire the rules on a passed node in a deferred mode as if
it was the new node just added to the optimizer. This would require slight
changes to RuleQueue.addMatch method, and possibly some other places.

Usage example:
class PhysicalChildRule extends RelOptRule {
    void onMatch(RelOptRuleCall call) {
        LogicalChild logicalRel = call.get(0);
        PhysicalChild physicalRel = ...;

        call.transformTo(physicalRel);
        optimizer.fireRules(logicalRel);
    }
}

What does the community think about such a method? Does it make any sense
to you? If not, do you aware of any other ways on how to organize bottom-up
optimization with VolcanoPlanner without the creation of additional rels?

If the community is OK in general, I can create try to create a PR with a
prototype.

Would appreciate your feedback.

Regards,
Vladimir.

[1]
https://ponymail-vm.apache.org/_GUI_/thread.html/13e7ab2040bfa4902db6647992ec4203ceb0262cfcb28d38ef7e9e32@%3Cdev.calcite.apache.org%3E

Reply via email to