Hi Chunwei Sure, I'll organize a document to list the root clauses about the optimization StackOverflow and fire JIRAS, thank for your help.
Regards! Aron Tao Chunwei Lei <[email protected]> 于2020年12月3日周四 下午4:01写道: > Thank you for your effort, Aron. > > I am wondering if you can file a jira issue and paste your findings. > I would like to look into it if I find some time. > > Best, > Chunwei > > > On Thu, Dec 3, 2020 at 1:15 PM JiaTao Tao <[email protected]> wrote: > > > Hi Vladimir > > Thanks for your reply: > > 1. I run TpchTest#testQuery07 in my MPB, the avg time is 43s, and after > > re-run the PR ut, the ut is passed, so this ut timeout may be an > occasional > > situation. > > > > 2. For the pull/2222, it due increase the time of TpchTest#testQuery07, > > about 23%, sorry for that, we test the feature in our product env(Calcite > > is our SQL facade in ByteDance, it process hundreds of thousands SQL per > > day), and we have monitor about optimization time, no obvious changes > were > > found. Without pull/2222, some cases will fail due to "There are not > enough > > rules to produce a node with desired properties: There is 1 empty > subset", > > the root cause is the outdated rule matches have been fired. > > > > Actually, I found actually the time was taken by RelSubset#getRelList, a > > fresh case I found yesterday, see > > https://github.com/apache/calcite/pull/2283, it OOM > > in VolcanoRuleCall.matchRecurse when calling getRelList(but the root > clause > > must be the deadlock between rules), so I think the pull/2222's fix is > not > > a problem, the true problem is in "getRelList", my proposal is we can > store > > the relList in RelSubSet like RelSet#rels. > > > > 3. For the case TpchTest#testQuery07, after optimization, the largest rel > > ID is increased to *900k*, and has *20k* rels remain in volcano planner, > > so it's pretty large. In fact, after profile, I found most of the time is > > taken by ProjectMergeRule, in our company's error cases(OOM, Stack > > Overflow) are also related to this rule, maybe we can see if we can > > optimize this. > > > > Finally, to summarize: > > 1. TpchTest#testQuery07 timeout is an occasional situation, normally it > > ends about 40~45s(in my MBP15 2019). > > 2. pull/2222 itself is a positive fix, it due increase the time but the > > true problem is RelSubset#getRelList, we can improve pull/2222 by > > improving RelSubset#getRelList, this not only benefits this, it > > makes getRelList a cheap function as its name, we can call this with > > no burden > > 3. Calcite sometime will generate huge alternate rels, it actually not > > reasonable most of the time, we've found many OOM/Stack Overflow/timeout, > > but the SQL is highly private, If you are interested, I can share the > > cases' root clause using blog or email, but I may have no time to > construct > > these cases. Let's workout together to decrease them. > > > > Regards! > > > > Aron Tao > > > > > > Vladimir Sitnikov <[email protected]> 于2020年12月2日周三 下午10:01写道: > > > > > Hi, > > > > > > I've made a quick check, and the results are as follows: > > > 1) The test was enabled recently (2020-10-20) in > > > > > > > > > https://github.com/apache/calcite/commit/b5a761e559ca1c1c914e388df4c6a0958dc17fc8 > > > 2) One of the recent changes that do significantly impact the > performance > > > of the test is https://github.com/apache/calcite/pull/2222 > > > 3) It is sad we commit performance-related fixes without taking > > performance > > > into consideration :-( > > > > > > Do you think you could look into 2222 again and check if the slowdown > was > > > expected? > > > Do you think you could improve 2222? > > > Do you think we should revert 2222? > > > > > > Vladimir > > > > > >
