[
https://issues.apache.org/jira/browse/KYLIN-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865934#comment-17865934
]
pengfei.zhan edited comment on KYLIN-5876 at 7/15/24 8:06 AM:
--------------------------------------------------------------
h1. Root Cause
The Calcite query plan after SQL is converted to RelRoot is as follows,
focusing on the CAST($0):DOUBLE part, where the stack overflow problem occurs
in the CBO phase
{code:java}
LogicalSort(fetch=[500])
LogicalProject(TEMP(Test)(3502400386)(0)=[CASE(=($1, 0), null,
/(CAST($0):DOUBLE, $1))])
LogicalJoin(condition=[true], joinType=[inner])
LogicalProject(__measure__0=[$0])
LogicalFilter(condition=[>($1, 0)])
LogicalAggregate(group=[{}], __measure__0=[SUM($0)], agg#1=[COUNT()])
LogicalProject(NUM1=[$2], $f1=[1])
OlapTableScan(table=[[TDVT, CALCS]], ctx=[], fields=[[0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26]])
LogicalProject(__measure__2=[$0])
LogicalFilter(condition=[>($1, 0)])
LogicalAggregate(group=[{}], __measure__2=[AVG($0)], agg#1=[COUNT()])
LogicalProject(__measure__0=[$0], $f1=[1])
LogicalProject(__measure__0=[$1])
LogicalAggregate(group=[{0}], __measure__0=[SUM($1)])
LogicalProject(STR1=[$7], NUM1=[$2])
OlapTableScan(table=[[TDVT, CALCS]], ctx=[], fields=[[0, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26]])
{code}
h1. Dev Design
CBO's optimization process is unstable, repeatedly matching the optimization
rules in order to select the least costly implementation, the problem here is
related to the execution order of the optimization rules (KE code has done a
version of renaming the optimization rules to move the classes, etc.), when the
OlapProjectMergeRule is matched after the When OlapProjectMergeRule matches
after ProjectReduceExpressionsRule, and the optimizer can't be sure that the
converted optimization rule must be better than the pre-conversion logic plan,
it keeps using CAST(xxx):DOUBLE to convert the previous ProjectRel, and
CAST($0):DOUBLE is indefinitely nested into something like
CAST(CAST(CAST($0):DOUBLE):DOUBLE):DOUBLE, which is the direct cause of the
stack overflow.
I briefly tried both fixes to get past it, and the exact implementation needs
to be discussed as to which one is better before fixing it:
In the implementation of kylin, the OlapProjectMergeRule conversion logic plan
after setting the weights, always think that the optimization rules after the
conversion is better than before the conversion (simple)
Reference DRILL: https://issues.apache.org/jira/browse/DRILL-6212 fix, took
simplifyCast's logic and tried it ( I'm guessing that it skips too many nested
calls and thus consumes less cost, so there is no infinite nesting. The
calculation of cost is multifaceted, but at the moment the KE side only has
simple information such as IO, memory, etc. as calculations.), but it needs to
be adjusted to fit the KE to be compatible with the previous implementation
(complex).
was (Author: JIRAUSER294653):
h1. Root Cause
The Calcite query plan after SQL is converted to RelRoot is as follows,
focusing on the CAST($0):DOUBLE part, where the stack overflow problem occurs
in the CBO phase
{code:java}
LogicalSort(fetch=[500])
LogicalProject(TEMP(Test)(3502400386)(0)=[CASE(=($1, 0), null,
/(CAST($0):DOUBLE, $1))])
LogicalJoin(condition=[true], joinType=[inner])
LogicalProject(__measure__0=[$0])
LogicalFilter(condition=[>($1, 0)])
LogicalAggregate(group=[{}], __measure__0=[SUM($0)], agg#1=[COUNT()])
LogicalProject(NUM1=[$2], $f1=[1])
OlapTableScan(table=[[TDVT, CALCS]], ctx=[], fields=[[0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26]])
LogicalProject(__measure__2=[$0])
LogicalFilter(condition=[>($1, 0)])
LogicalAggregate(group=[{}], __measure__2=[AVG($0)], agg#1=[COUNT()])
LogicalProject(__measure__0=[$0], $f1=[1])
LogicalProject(__measure__0=[$1])
LogicalAggregate(group=[{0}], __measure__0=[SUM($1)])
LogicalProject(STR1=[$7], NUM1=[$2])
OlapTableScan(table=[[TDVT, CALCS]], ctx=[], fields=[[0, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26]])
{code}
CBO's optimization process is unstable, repeatedly matching the optimization
rules in order to select the least costly implementation, the problem here is
related to the execution order of the optimization rules (KE code has done a
version of renaming the optimization rules to move the classes, etc.), when the
OlapProjectMergeRule is matched after the When OlapProjectMergeRule matches
after ProjectReduceExpressionsRule, and the optimizer can't be sure that the
converted optimization rule must be better than the pre-conversion logic plan,
it keeps using CAST(xxx):DOUBLE to convert the previous ProjectRel, and
CAST($0):DOUBLE is indefinitely nested into something like
CAST(CAST(CAST($0):DOUBLE):DOUBLE):DOUBLE, which is the direct cause of the
stack overflow.
I briefly tried both fixes to get past it, and the exact implementation needs
to be discussed as to which one is better before fixing it:
In the implementation of kylin, the OlapProjectMergeRule conversion logic plan
after setting the weights, always think that the optimization rules after the
conversion is better than before the conversion (simple)
Reference DRILL: https://issues.apache.org/jira/browse/DRILL-6212 fix, took
simplifyCast's logic and tried it ( I'm guessing that it skips too many nested
calls and thus consumes less cost, so there is no infinite nesting. The
calculation of cost is multifaceted, but at the moment the KE side only has
simple information such as IO, memory, etc. as calculations.), but it needs to
be adjusted to fit the KE to be compatible with the previous implementation
(complex).
> Avoid infinite match in OlapProjectMergeRule caused by simplify projects
> ------------------------------------------------------------------------
>
> Key: KYLIN-5876
> URL: https://issues.apache.org/jira/browse/KYLIN-5876
> Project: Kylin
> Issue Type: Improvement
> Components: Query Engine
> Affects Versions: 5.0-alpha
> Reporter: pengfei.zhan
> Assignee: pengfei.zhan
> Priority: Major
> Fix For: 5.0.0
>
>
> tdvt test compared to version kylin-5.0-alpha a new failed sql has been
> added, the query reports an error
> {code:sql}
> SELECT (CASE WHEN "t2"."__measure__2" = 0 THEN NULL ELSE
> CAST("t0"."__measure__0" AS DOUBLE PRECISION) / "t2"."__measure__2" END) AS
> "TEMP(Test)(3502400386)(0)"
> FROM (
> SELECT SUM("CALCS"."NUM1") AS "__measure__0"
> FROM "TDVT"."CALCS" "CALCS"
> HAVING (COUNT(1) > 0)
> ) "t0"
> CROSS JOIN (
> SELECT AVG("t1"."__measure__0") AS "__measure__2"
> FROM (
> SELECT SUM("CALCS"."NUM1") AS "__measure__0"
> FROM "TDVT"."CALCS" "CALCS"
> GROUP BY "CALCS"."STR1"
> ) "t1"
> HAVING (COUNT(1) > 0)
> ) "t2"
> LIMIT 500
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)