[ 
https://issues.apache.org/jira/browse/KYLIN-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shao Feng Shi reassigned KYLIN-4945:
------------------------------------

    Assignee: ShengJun Zheng

> Repartition encoded dataset to avoid data skew caused by a single column
> ------------------------------------------------------------------------
>
>                 Key: KYLIN-4945
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4945
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>    Affects Versions: v4.0.0-beta
>            Reporter: ShengJun Zheng
>            Assignee: ShengJun Zheng
>            Priority: Minor
>             Fix For: v4.0.0-GA
>
>         Attachments: image-2021-03-24-17-37-57-505.png
>
>
> In KYLIN4, global dictionary will be split into several buckets. To encode 
> flat datasource table more efficiently, source dataset will be repartitioned 
> to the same amount of partitions as the dictionary's bucket size. It 
> sometimes bring side effect, because repartition by a single column is more 
> likely to cause data skew.
> We have a case that a topN/count_distinct measure has serious data skew. The 
> dataset get skewed after repartition, causing one task takes the majority of 
> time in first layer's cuboid build job.
> !image-2021-03-24-17-37-57-505.png!
> To improve this case, we add a step to repartition the encoded dataset by all 
> RowKey columns, and the first layer's build time reduced from 20min to 4min.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to