[
https://issues.apache.org/jira/browse/KYLIN-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaoxiang Yu resolved KYLIN-5011.
---------------------------------
Resolution: Fixed
> Detect and scatter skewed data in dict encoding step
> ----------------------------------------------------
>
> Key: KYLIN-5011
> URL: https://issues.apache.org/jira/browse/KYLIN-5011
> Project: Kylin
> Issue Type: New Feature
> Components: Job Engine
> Affects Versions: v4.0.0-beta
> Reporter: Shengjun Zheng
> Assignee: Shengjun Zheng
> Priority: Major
> Fix For: v4.0.0
>
> Attachments: image-2021-06-15-10-54-19-419.png
>
>
> In KYLIN4, dictionaries are hashed into several buckets, column data are
> repartitioned to the same partition size as bucket size. Then, each encoding
> task is able to load a piece of dictionary bucket to accelerate the encoding
> step.
> Recently we are troubled by this improvement when data skew happens. In some
> of our cases, the repartition step during encoding is even impossible to
> finish . Whereas this works fine in KYLIN3, for each Spark task will load all
> dictionary of a column and encode column values to int values. There is no
> need to do repartition step in KYLIN3.
> We solve this by:
> # sample from source data and detect skewed data
> # build skewed data's dictionary
> # customize an repartition function to scatter skewed data to random
> partitions
> # do encoding with both skewed dictionary and dictionary loaded within each
> partition
> After this improvement, some of our cube's build time reduced from 190min to
> 30min
> !image-2021-06-15-10-54-19-419.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)