[
https://issues.apache.org/jira/browse/KYLIN-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shao Feng Shi reassigned KYLIN-4945:
------------------------------------
Assignee: ShengJun Zheng
> Repartition encoded dataset to avoid data skew caused by a single column
> ------------------------------------------------------------------------
>
> Key: KYLIN-4945
> URL: https://issues.apache.org/jira/browse/KYLIN-4945
> Project: Kylin
> Issue Type: Improvement
> Components: Job Engine
> Affects Versions: v4.0.0-beta
> Reporter: ShengJun Zheng
> Assignee: ShengJun Zheng
> Priority: Minor
> Fix For: v4.0.0-GA
>
> Attachments: image-2021-03-24-17-37-57-505.png
>
>
> In KYLIN4, global dictionary will be split into several buckets. To encode
> flat datasource table more efficiently, source dataset will be repartitioned
> to the same amount of partitions as the dictionary's bucket size. It
> sometimes bring side effect, because repartition by a single column is more
> likely to cause data skew.
> We have a case that a topN/count_distinct measure has serious data skew. The
> dataset get skewed after repartition, causing one task takes the majority of
> time in first layer's cuboid build job.
> !image-2021-03-24-17-37-57-505.png!
> To improve this case, we add a step to repartition the encoded dataset by all
> RowKey columns, and the first layer's build time reduced from 20min to 4min.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)