[jira] [Commented] (KYLIN-5011) Detect and scatter skewed data in dict encoding step

ASF subversion and git services (Jira) Tue, 10 Aug 2021 18:53:06 -0700


    [ 
https://issues.apache.org/jira/browse/KYLIN-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396972#comment-17396972
 ]


ASF subversion and git services commented on KYLIN-5011:
--------------------------------------------------------

Commit 6416ea0f426ccfa60bed707693bccc0e7676eac2 in kylin's branch 
refs/heads/main from Shengjun Zheng
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=6416ea0 ]

KYLIN-5011 Detect and scatter skewed data in dict encoding step


> Detect and scatter skewed data in dict encoding step
> ----------------------------------------------------
>
>                 Key: KYLIN-5011
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5011
>             Project: Kylin
>          Issue Type: New Feature
>          Components: Job Engine
>    Affects Versions: v4.0.0-beta
>            Reporter: Shengjun Zheng
>            Assignee: Shengjun Zheng
>            Priority: Major
>             Fix For: v4.0.0
>
>         Attachments: image-2021-06-15-10-54-19-419.png
>
>
> In KYLIN4, dictionaries are hashed into several buckets, column data are 
> repartitioned to the same partition size as bucket size. Then, each encoding 
> task is able to load a piece of  dictionary bucket to accelerate the encoding 
> step. 
> Recently we are troubled by this improvement when data skew happens. In some 
> of our cases, the repartition step during encoding is even impossible to 
> finish . Whereas this works fine in KYLIN3, for each Spark task will load all 
> dictionary of a column and encode column values to int values. There is no 
> need to do repartition step in KYLIN3.
> We solve this by:
>  # sample from source data and detect skewed data
>  # build skewed data's dictionary
>  # customize an repartition function to scatter skewed data to random 
> partitions
>  # do encoding with both skewed dictionary and dictionary loaded within each 
> partition
> After this improvement, some of our cube's build time reduced from 190min to 
> 30min
> !image-2021-06-15-10-54-19-419.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KYLIN-5011) Detect and scatter skewed data in dict encoding step

Reply via email to