[
https://issues.apache.org/jira/browse/KYLIN-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728555#comment-16728555
]
ASF subversion and git services commented on KYLIN-3540:
--------------------------------------------------------
Commit 8cfe32cf4f61d218439e57e134ccc3413aa98f89 in kylin's branch
refs/heads/master from kyotoYaho
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=8cfe32c ]
KYLIN-3540 move queryService of CubeController to CubeService
> Improve Mandatory Cuboid Recommendation Algorithm
> -------------------------------------------------
>
> Key: KYLIN-3540
> URL: https://issues.apache.org/jira/browse/KYLIN-3540
> Project: Kylin
> Issue Type: Improvement
> Reporter: Zhong Yanghong
> Assignee: Zhong Yanghong
> Priority: Major
> Fix For: v2.6.0
>
>
> Previously to add cuboids which are not prebuilt, the cube planner turns to
> mandatory cuboids which are selected if its rollup row count is above some
> threshold. There are two shortcomings:
> * The way to estimate the rollup row count is not good
> * It's hard to determine the threshold of rollup row count for recommending
> mandatory cuboids
> bq. {color:#f79232}The improved way to estimate the rollup row count is as
> follows:{color}
> Current criteria to recommend mandatory cuboids is based on the average
> rollup count collected with query metrics. There's a disadvantage. An example
> is as follows:
> Cuboid (A,B) has 1000 rows, prebuilt; Cuboid (B) has 10 rows, not prebuilt;
> The ground truth for the rollup count from Cuboid (A,B) to Cuboid (B) is
> {code}
> Cuboid (A,B) - Cuboid (A) = 1000 - 10 = 990
> {code}
> Suppose B is evenly composed with A. Then for each value of B with A, the row
> count is 1000 * (10/100) = 100.
> Now for sql
> {code}
> select B, count(*)
> from T
> where B = 'e1'
> group by B
> {code}
> Then the rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1'}) - return count = 100 - 1 = 99
> {code}
> which is much smaller than 990 due to the influence of lots of filtered row
> count.
> It's better to calculate the rollup rate first and then multiple the parent
> cuboid row count to estimate the rollup count. The refined formula is as
> follows:
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1'}) =
> 1000-1000*1/100 = 990
> {code}
> Another sql
> {code}
> select count(*)
> from T
> where B in {'e1','e2'}
> {code}
> The rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1','e2'}) - return count = 100*2 - 1 = 199
> {code}
> The rollup count by refined algorithm will be
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1','e2'}) =
> 1000-1000*1/(100*2) = 995
> {code}
> Above all, the refined algorithm will be much less influenced by filters in
> sql.
> bq. {color:#f79232}Don't recommend mandatory cuboids & don't need the
> threshold
> {color}
> Previously the reason to recommend mandatory cuboids is that they are not
> prebuilt and their row count statistics are not known, which causes it's not
> possible to apply cube planner algorithm for them. Now by the improved way of
> estimating rollup row count, we can better estimate the row count statistics
> for those cuboids which are not prebuilt. Then the cost-based cube planner
> algorithm will decide which cuboid to be built or not and the threshold is
> not needed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)