[ 
https://issues.apache.org/jira/browse/KYLIN-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370216#comment-17370216
 ] 

ASF subversion and git services commented on KYLIN-5011:
--------------------------------------------------------

Commit 914b97f5cf2347030525140038d060178b93f955 in kylin's branch 
refs/heads/kylin-on-parquet-v2 from zhengshengjun
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=914b97f ]

KYLIN-5011 Detect and scatter skewed data in dict encoding step (#1662)

Co-authored-by: Xiaoxiang Yu <[email protected]>

> Detect and scatter skewed data in dict encoding step
> ----------------------------------------------------
>
>                 Key: KYLIN-5011
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5011
>             Project: Kylin
>          Issue Type: New Feature
>          Components: Job Engine
>    Affects Versions: v4.0.0-beta
>            Reporter: ShengJun Zheng
>            Assignee: ShengJun Zheng
>            Priority: Major
>             Fix For: v4.0.0
>
>         Attachments: image-2021-06-15-10-54-19-419.png
>
>
> In KYLIN4, dictionaries are hashed into several buckets, column data are 
> repartitioned to the same partition size as bucket size. Then, each encoding 
> task is able to load a piece of  dictionary bucket to accelerate the encoding 
> step. 
> Recently we are troubled by this improvement when data skew happens. In some 
> of our cases, the repartition step during encoding is even impossible to 
> finish . Whereas this works fine in KYLIN3, for each Spark task will load all 
> dictionary of a column and encode column values to int values. There is no 
> need to do repartition step in KYLIN3.
> We solve this by:
>  # sample from source data and detect skewed data
>  # build skewed data's dictionary
>  # customize an repartition function to scatter skewed data to random 
> partitions
>  # do encoding with both skewed dictionary and dictionary loaded within each 
> partition
> After this improvement, some of our cube's build time reduced from 190min to 
> 30min
> !image-2021-06-15-10-54-19-419.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to