[
https://issues.apache.org/jira/browse/KYLIN-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396972#comment-17396972
]
ASF subversion and git services commented on KYLIN-5011:
--------------------------------------------------------
Commit 6416ea0f426ccfa60bed707693bccc0e7676eac2 in kylin's branch
refs/heads/main from Shengjun Zheng
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=6416ea0 ]
KYLIN-5011 Detect and scatter skewed data in dict encoding step
> Detect and scatter skewed data in dict encoding step
> ----------------------------------------------------
>
> Key: KYLIN-5011
> URL: https://issues.apache.org/jira/browse/KYLIN-5011
> Project: Kylin
> Issue Type: New Feature
> Components: Job Engine
> Affects Versions: v4.0.0-beta
> Reporter: Shengjun Zheng
> Assignee: Shengjun Zheng
> Priority: Major
> Fix For: v4.0.0
>
> Attachments: image-2021-06-15-10-54-19-419.png
>
>
> In KYLIN4, dictionaries are hashed into several buckets, column data are
> repartitioned to the same partition size as bucket size. Then, each encoding
> task is able to load a piece of dictionary bucket to accelerate the encoding
> step.
> Recently we are troubled by this improvement when data skew happens. In some
> of our cases, the repartition step during encoding is even impossible to
> finish . Whereas this works fine in KYLIN3, for each Spark task will load all
> dictionary of a column and encode column values to int values. There is no
> need to do repartition step in KYLIN3.
> We solve this by:
> # sample from source data and detect skewed data
> # build skewed data's dictionary
> # customize an repartition function to scatter skewed data to random
> partitions
> # do encoding with both skewed dictionary and dictionary loaded within each
> partition
> After this improvement, some of our cube's build time reduced from 190min to
> 30min
> !image-2021-06-15-10-54-19-419.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)