Hi Alessandro,
this problem was already discussed on dev-list [1] and we have a ticket
for this [2].
My concern is that many projects use Calcite as a Lego kit: they took
internal components of Calcite and combine them for building a custom
planning and execution pipeline. And sometimes downstream projects need
to change the default behavior of internal components to fit their
requirements or overcome the bug. So the idea of keeping even internal
components of Calcite "more public" is rather a good thing than the bad
one from my point of view.
Thank you.
[1] https://lists.apache.org/thread/cykl74dcphgow4790fwoc8frsjglz7n1
[2] https://issues.apache.org/jira/browse/CALCITE-4542
--
Roman Kondakov
On 11.02.2022 19:15, Alessandro Solimando wrote:
Hello everyone,
@Vladimir, +1 on the change introducing "enforceDerive()".
@Roman, could you walk us through the limitations you found that forced you
to copy-paste the whole class?
Maybe there is some middle ground for your problem(s) too, similar in
spirit to what Vladimir proposed for the other limitation.
I am not against making the class more public if necessary, but it would be
nice to have a discussion here before going down that path.
If the discussion leads to a better design of the original class, all
projects would benefit from that.
Best regards,
Alessandro
On Fri, 11 Feb 2022 at 04:14, Roman Kondakov <[email protected]>
wrote:
Hi Vladimir,
+1 for making the rule driver more public. We've faced similar problems
in the downstream project. The solution was to copy and paste the
TopDownRuleDrive code with small fixes since it was not possible to
override the default behavior.
--
Roman Kondakov
On 11.02.2022 02:50, Vladimir Ozerov wrote:
Hi,
In the Cascades driver, it is possible to propagate the requests top-down
using the "passThrough", method and then notify parents bottom-up about
the
concrete physical implementations of inputs using the "derive" method.
In some optimizers, the valid parent node cannot be created before the
trait sets of inputs are known. An example is a custom distribution trait
that includes the number of shards in the system. The parent operator
alone
may guess the distribution keys, but cannot know the number of input
shards. To mitigate this, you may create a "template" node with an
infinite
cost from within the optimization rule that will propagate the
passThrough/drive calls but would never participate in the final plan.
Currency, the top-down driver designed in a way that the nodes created
from
the "passThrough" method are not notified on the "derive" stage. This
leads
to the incomplete exploration of the search space. For example, the rule
may produce the node "A1.template" that will be converted into a normal
"A1" node in the derive phase. However, if the parent operator produced
"A2.template" from "A1.template" using pass-through mechanics, the
"A2.template" will never be notified about the concrete input traits,
possibly losing the optimal plan. This is especially painful in
distributed
engines, where the number of shards is important for the placement of
Shuffle operators.
It seems that the problem could be solved with relatively low effort. The
"derive" is not invoked on the nodes created from the "passThrough"
method,
because such nodes are placed in the "passThroughCache" collection.
Instead
of doing this unconditionally, we may introduce an additional predicate
that would selectively enforce "derive" on such nodes. For example, this
could be a default method in the PhysicalNode interface, like:
interface PhysicalNode {
default boolean enforceDerive() { return false; }
}
If there are no objections, I'll proceed with this change.
Alternatively, we may make the TopDownRuleDriver more "public", so that
the
user can extend it and decide within the driver whether to cache a
particular node or not.
I would appreciate your feedback on the matter.
Regards,
Vladimir.