Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

JiaTao Tao Thu, 03 Dec 2020 06:03:50 -0800

Hi Chunwei
Sure, I'll organize a document to list the root clauses about the
optimization StackOverflow and fire JIRAS, thank for your help.


Regards!

Aron Tao


Chunwei Lei <[email protected]> 于2020年12月3日周四 下午4:01写道：

> Thank you for your effort, Aron.
>
> I am wondering if you can file a jira issue and paste your findings.
> I would like to look into it if I find some time.
>
> Best,
> Chunwei
>
>
> On Thu, Dec 3, 2020 at 1:15 PM JiaTao Tao <[email protected]> wrote:
>
> > Hi Vladimir
> > Thanks for your reply:
> > 1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after
> > re-run the PR ut, the ut is passed, so this ut timeout may be an
> occasional
> > situation.
> >
> > 2. For the pull/2222, it due increase the time of TpchTest#testQuery07,
> > about 23%, sorry for that, we test the feature in our product env(Calcite
> > is our SQL facade in ByteDance, it process hundreds of thousands SQL per
> > day), and we have monitor about optimization time, no obvious changes
> were
> > found. Without pull/2222, some cases will fail due to "There are not
> enough
> > rules to produce a node with desired properties: There is 1 empty
> subset",
> > the root cause is the outdated rule matches have been fired.
> >
> > Actually, I found actually the time was taken by RelSubset#getRelList, a
> > fresh case I found yesterday, see
> > https://github.com/apache/calcite/pull/2283, it OOM
> > in VolcanoRuleCall.matchRecurse when calling getRelList(but the root
> clause
> > must be the deadlock between rules), so I think the pull/2222's fix is
> not
> > a problem, the true problem is in "getRelList", my proposal is we can
> store
> > the relList in RelSubSet like RelSet#rels.
> >
> > 3. For the case TpchTest#testQuery07, after optimization, the largest rel
> > ID is increased to *900k*, and has *20k* rels remain in volcano planner,
> > so it's pretty large. In fact, after profile, I found most of the time is
> > taken by ProjectMergeRule, in our company's error cases(OOM, Stack
> > Overflow) are also related to this rule, maybe we can see if we can
> > optimize this.
> >
> > Finally, to summarize:
> > 1. TpchTest#testQuery07 timeout is an occasional situation, normally it
> > ends about 40~45s(in my MBP15 2019).
> > 2. pull/2222 itself is a positive fix, it due increase the time but the
> > true problem is RelSubset#getRelList, we can improve pull/2222 by
> > improving RelSubset#getRelList, this not only benefits this, it
> > makes getRelList a cheap function as its name, we can call this with
> > no burden
> > 3. Calcite sometime will generate huge alternate rels, it actually not
> > reasonable most of the time, we've found many OOM/Stack Overflow/timeout,
> > but the SQL is highly private, If you are interested, I can share the
> > cases' root clause using blog or email, but I may have no time to
> construct
> > these cases. Let's workout together to decrease them.
> >
> > Regards!
> >
> > Aron Tao
> >
> >
> > Vladimir Sitnikov <[email protected]> 于2020年12月2日周三 下午10:01写道：
> >
> > > Hi,
> > >
> > > I've made a quick check, and the results are as follows:
> > > 1) The test was enabled recently (2020-10-20) in
> > >
> > >
> >
> https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8
> > > 2) One of the recent changes that do significantly impact the
> performance
> > > of the test is https://github.com/apache/calcite/pull/2222
> > > 3) It is sad we commit performance-related fixes without taking
> > performance
> > > into consideration :-(
> > >
> > > Do you think you could look into 2222 again and check if the slowdown
> was
> > > expected?
> > > Do you think you could improve 2222?
> > > Do you think we should revert 2222?
> > >
> > > Vladimir
> > >
> >
>

Re: TpchTest#testQuery07 timeout when running continuous-integration/appveyor/pr

Reply via email to