[jira] [Comment Edited] (KYLIN-5787) Use t-digest as spark percentile_approx function

pengfei.zhan (Jira) Tue, 09 Apr 2024 05:18:03 -0700


    [ 
https://issues.apache.org/jira/browse/KYLIN-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835371#comment-17835371
 ]


pengfei.zhan edited comment on KYLIN-5787 at 4/9/24 12:17 PM:
--------------------------------------------------------------

h1. The old behavior

 
|| ||*percentile*||*percentile_approx*||
|Precomputation|t-digest|t-digest|
|runtime computation|QuantileSummaries|QuantileSummaries|
|pushdown / spark-sql|Sort and take the exact value|QuantileSummaries|

 
h1. Design

Add configuration "kylin.query.percentile-approx-algorithm", default value is 
null, keep current behavior unchanged by default, project level setting is not 
supported, restart KYLIN to make it work.

Configure the optional value "t-digest", the configured behavior is as follows
 
|| ||*percentile*||*percentile_approx*||
|Precomputation|t-digest|t-digest|
|runtime computation|t-digest|t-digest|
|pushdown|Sort and take the exact value|t-digest|
|spark-sql|Sort and take the exact value|QuantileSummaries|

runtime computation means need extra aggregation on the layout(also called 
cuboid).

 

More info please refer to: 
https://cn.kyligence.io/resources/kyligence-public-seminar-190403/


was (Author: JIRAUSER294653):
h1. The old behavior
 
|| ||*percentile*||*percentile_approx*||
|Precomputation|t-digest|t-digest|
|runtime computation|QuantileSummaries|QuantileSummaries|
|pushdown / spark-sql|Sort and take the exact value|QuantileSummaries|
 
h1. Design

Add configuration "kylin.query.percentile-approx-algorithm", default value is 
null, keep current behavior unchanged by default, project level setting is not 
supported, restart KYLIN to make it work.


Configure the optional value "t-digest", the configured behavior is as follows
 
|| ||*percentile*||*percentile_approx*||
|Precomputation|t-digest|t-digest|
|runtime computation|t-digest|t-digest|
|pushdown|Sort and take the exact value|t-digest|
|spark-sql|Sort and take the exact value|QuantileSummaries|

runtime computation means need extra aggregation on the layout(also called 
cuboid).

> Use t-digest as spark percentile_approx function
> ------------------------------------------------
>
>                 Key: KYLIN-5787
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5787
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine, Query Engine
>    Affects Versions: 5.0-beta
>            Reporter: pengfei.zhan
>            Assignee: pengfei.zhan
>            Priority: Critical
>             Fix For: 5.0-beta
>
>
> The underlying implementation of the percentile_approx function in KYLIN is 
> the open-source t-digest.
> The underlying implementation of the percentile_approx function in spark is 
> spark's own PercentileDigest (based on QuantileSummaries).
> Different implementations lead to different results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KYLIN-5787) Use t-digest as spark percentile_approx function

Reply via email to