[ https://issues.apache.org/jira/browse/KYLIN-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaoxiang Yu resolved KYLIN-5011. --------------------------------- Resolution: Fixed > Detect and scatter skewed data in dict encoding step > ---------------------------------------------------- > > Key: KYLIN-5011 > URL: https://issues.apache.org/jira/browse/KYLIN-5011 > Project: Kylin > Issue Type: New Feature > Components: Job Engine > Affects Versions: v4.0.0-beta > Reporter: Shengjun Zheng > Assignee: Shengjun Zheng > Priority: Major > Fix For: v4.0.0 > > Attachments: image-2021-06-15-10-54-19-419.png > > > In KYLIN4, dictionaries are hashed into several buckets, column data are > repartitioned to the same partition size as bucket size. Then, each encoding > task is able to load a piece of dictionary bucket to accelerate the encoding > step. > Recently we are troubled by this improvement when data skew happens. In some > of our cases, the repartition step during encoding is even impossible to > finish . Whereas this works fine in KYLIN3, for each Spark task will load all > dictionary of a column and encode column values to int values. There is no > need to do repartition step in KYLIN3. > We solve this by: > # sample from source data and detect skewed data > # build skewed data's dictionary > # customize an repartition function to scatter skewed data to random > partitions > # do encoding with both skewed dictionary and dictionary loaded within each > partition > After this improvement, some of our cube's build time reduced from 190min to > 30min > !image-2021-06-15-10-54-19-419.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)