Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Julian Hyde Thu, 03 Dec 2020 12:21:17 -0800

A long time ago, we stored the list of RelNodes in each RelSubset, and
changed to a shared list in RelSet. I still think it makes sense,
especially when subsets are not disjoint (e.g. if a set has one subset
sorted by (x) and another subset sorted by (x, y) then all of the
RelNodes in the latter will also be in the former subset).


A simple optimization to getRelList makes testQuery07 about 20% faster
(39s to 33s for me):

diff --git a/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java
b/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java
index b1e0b6322..7f7e8cbc9 100644
--- a/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java
+++ b/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java
@@ -399,7 +399,7 @@ RelNode buildCheapestPlan(VolcanoPlanner planner) {
   public List<RelNode> getRelList() {
     final List<RelNode> list = new ArrayList<>();
     for (RelNode rel : set.rels) {
-      if (rel.getTraitSet().satisfies(traitSet)) {
+      if (rel.getTraitSet() == traitSet ||
rel.getTraitSet().satisfies(traitSet)) {
         list.add(rel);
       }
     }

It's worth pushing down that logic into RelTraitSet.satisfies, so
other code will benefit. And maybe some further improvements to
RelTraitSet.satisfies and RelTrait.satisfies.

Julian

On Thu, Dec 3, 2020 at 6:03 AM JiaTao Tao <[email protected]> wrote:
>
> Hi Chunwei
> Sure, I'll organize a document to list the root clauses about the
> optimization StackOverflow and fire JIRAS, thank for your help.
>
> Regards!
>
> Aron Tao
>
>
> Chunwei Lei <[email protected]> 于2020年12月3日周四 下午4:01写道：
>
> > Thank you for your effort, Aron.
> >
> > I am wondering if you can file a jira issue and paste your findings.
> > I would like to look into it if I find some time.
> >
> > Best,
> > Chunwei
> >
> >
> > On Thu, Dec 3, 2020 at 1:15 PM JiaTao Tao <[email protected]> wrote:
> >
> > > Hi Vladimir
> > > Thanks for your reply:
> > > 1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after
> > > re-run the PR ut, the ut is passed, so this ut timeout may be an
> > occasional
> > > situation.
> > >
> > > 2. For the pull/2222, it due increase the time of TpchTest#testQuery07,
> > > about 23%, sorry for that, we test the feature in our product env(Calcite
> > > is our SQL facade in ByteDance, it process hundreds of thousands SQL per
> > > day), and we have monitor about optimization time, no obvious changes
> > were
> > > found. Without pull/2222, some cases will fail due to "There are not
> > enough
> > > rules to produce a node with desired properties: There is 1 empty
> > subset",
> > > the root cause is the outdated rule matches have been fired.
> > >
> > > Actually, I found actually the time was taken by RelSubset#getRelList, a
> > > fresh case I found yesterday, see
> > > https://github.com/apache/calcite/pull/2283, it OOM
> > > in VolcanoRuleCall.matchRecurse when calling getRelList(but the root
> > clause
> > > must be the deadlock between rules), so I think the pull/2222's fix is
> > not
> > > a problem, the true problem is in "getRelList", my proposal is we can
> > store
> > > the relList in RelSubSet like RelSet#rels.
> > >
> > > 3. For the case TpchTest#testQuery07, after optimization, the largest rel
> > > ID is increased to *900k*, and has *20k* rels remain in volcano planner,
> > > so it's pretty large. In fact, after profile, I found most of the time is
> > > taken by ProjectMergeRule, in our company's error cases(OOM, Stack
> > > Overflow) are also related to this rule, maybe we can see if we can
> > > optimize this.
> > >
> > > Finally, to summarize:
> > > 1. TpchTest#testQuery07 timeout is an occasional situation, normally it
> > > ends about 40~45s(in my MBP15 2019).
> > > 2. pull/2222 itself is a positive fix, it due increase the time but the
> > > true problem is RelSubset#getRelList, we can improve pull/2222 by
> > > improving RelSubset#getRelList, this not only benefits this, it
> > > makes getRelList a cheap function as its name, we can call this with
> > > no burden
> > > 3. Calcite sometime will generate huge alternate rels, it actually not
> > > reasonable most of the time, we've found many OOM/Stack Overflow/timeout,
> > > but the SQL is highly private, If you are interested, I can share the
> > > cases' root clause using blog or email, but I may have no time to
> > construct
> > > these cases. Let's workout together to decrease them.
> > >
> > > Regards!
> > >
> > > Aron Tao
> > >
> > >
> > > Vladimir Sitnikov <[email protected]> 于2020年12月2日周三 下午10:01写道：
> > >
> > > > Hi,
> > > >
> > > > I've made a quick check, and the results are as follows:
> > > > 1) The test was enabled recently (2020-10-20) in
> > > >
> > > >
> > >
> > https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8
> > > > 2) One of the recent changes that do significantly impact the
> > performance
> > > > of the test is https://github.com/apache/calcite/pull/2222
> > > > 3) It is sad we commit performance-related fixes without taking
> > > performance
> > > > into consideration :-(
> > > >
> > > > Do you think you could look into 2222 again and check if the slowdown
> > was
> > > > expected?
> > > > Do you think you could improve 2222?
> > > > Do you think we should revert 2222?
> > > >
> > > > Vladimir
> > > >
> > >
> >

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Reply via email to