[ 
https://issues.apache.org/jira/browse/KYLIN-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317869#comment-17317869
 ] 

ASF subversion and git services commented on KYLIN-4945:
--------------------------------------------------------

Commit dfd012f45b9740ae0598041d7f1326e3a58c0da7 in kylin's branch 
refs/heads/kylin-on-parquet-v2 from zhengshengjun
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=dfd012f ]

KYLIN-4945 repartition encoded dataset to avoid data skew caused by single 
column


> Repartition encoded dataset to avoid data skew caused by a single column
> ------------------------------------------------------------------------
>
>                 Key: KYLIN-4945
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4945
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>    Affects Versions: v4.0.0-beta
>            Reporter: ShengJun Zheng
>            Assignee: ShengJun Zheng
>            Priority: Minor
>             Fix For: v4.0.0-GA
>
>         Attachments: image-2021-03-24-17-37-57-505.png
>
>
> In KYLIN4, global dictionary will be split into several buckets. To encode 
> flat datasource table more efficiently, source dataset will be repartitioned 
> to the same amount of partitions as the dictionary's bucket size. It 
> sometimes bring side effect, because repartition by a single column is more 
> likely to cause data skew.
> We have a case that a topN/count_distinct measure has serious data skew. The 
> dataset get skewed after repartition, causing one task takes the majority of 
> time in first layer's cuboid build job.
> !image-2021-03-24-17-37-57-505.png!
> To improve this case, we add a step to repartition the encoded dataset by all 
> RowKey columns, and the first layer's build time reduced from 20min to 4min.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to