[ 
https://issues.apache.org/jira/browse/KYLIN-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724653#comment-16724653
 ] 

ASF GitHub Bot commented on KYLIN-3540:
---------------------------------------

kyotoYaho opened a new pull request #407: KYLIN-3540 estimate the row counts of 
source cuboids which are not built & remove mandatory cuboids recommendation
URL: https://github.com/apache/kylin/pull/407
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Improve Mandatory Cuboid Recommendation Algorithm
> -------------------------------------------------
>
>                 Key: KYLIN-3540
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3540
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>             Fix For: v2.6.0
>
>
> Current criteria to recommend mandatory cuboids is based on the average 
> rollup count collected with query metrics. There's a disadvantage. An example 
> is as follows:
> Cuboid (A,B) has 1000 rows, prebuilt; Cuboid (B) has 10 rows, not prebuilt; 
> The ground truth for the rollup count from Cuboid (A,B) to Cuboid (B) is
> {code}
> Cuboid (A,B) - Cuboid (A) = 1000 - 10 = 990
> {code}
> Suppose B is evenly composed with A. Then for each value of B with A, the row 
> count is 1000 * (10/100) = 100.
> Now for sql 
> {code}
> select B, count(*)
> from T
> where B = 'e1'
> group by B
> {code}
> Then the rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1'}) - return count = 100 - 1 = 99
> {code}
> which is much smaller than 990 due to the influence of lots of filtered row 
> count.
> It's better to calculate the rollup rate first and then multiple the parent 
> cuboid row count to estimate the rollup count. The refined formula is as 
> follows:
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1'}) = 
> 1000-1000*1/100 = 990
> {code}
> Another sql
> {code}
> select count(*)
> from T
> where B in {'e1','e2'}
> {code}
> The rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1','e2'}) - return count = 100*2 - 1 = 199
> {code}
> The rollup count by refined algorithm will be
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1','e2'}) = 
> 1000-1000*1/(100*2) = 995
> {code}
> Above all, the refined algorithm will be much less influenced by filters in 
> sql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to