[
https://issues.apache.org/jira/browse/KYLIN-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315299#comment-17315299
]
ASF GitHub Bot commented on KYLIN-4945:
---------------------------------------
zzcclp commented on a change in pull request #1618:
URL: https://github.com/apache/kylin/pull/1618#discussion_r607579269
##########
File path:
kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/builder/CubeTableEncoder.scala
##########
@@ -63,7 +67,20 @@ object CubeTableEncoder extends Logging {
.select(columns: _*)
}
)
+
ds.sparkSession.sparkContext.setJobDescription(null)
+
+ //repartition by a single column during dict encode step before is more
easily to cause data skew, add step to void such case.
+ if (!cols.isEmpty && seg.kylinconf.rePartitionEncodedDatasetWithRowKey) {
+ val colsInDS = partitionedDs.schema.map(_.name)
+ val rowKeyColRefs = seg.allRowKeyCols.map(colDesc =>
convertFromDot(colDesc.identity)).filter(colsInDS.contains).map(col)
+ //if not set in config, use the largest partition num during dict encode
step
+ if (seg.kylinconf.getRepartitionNumAfterEncode != 0) {
Review comment:
change to `if (seg.kylinconf.getRepartitionNumAfterEncode < 1) {`, avoid
to run failed when setting this value to -1.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Repartition encoded dataset to avoid data skew caused by a single column
> ------------------------------------------------------------------------
>
> Key: KYLIN-4945
> URL: https://issues.apache.org/jira/browse/KYLIN-4945
> Project: Kylin
> Issue Type: Improvement
> Components: Job Engine
> Affects Versions: v4.0.0-beta
> Reporter: ShengJun Zheng
> Assignee: ShengJun Zheng
> Priority: Minor
> Fix For: v4.0.0-GA
>
> Attachments: image-2021-03-24-17-37-57-505.png
>
>
> In KYLIN4, global dictionary will be split into several buckets. To encode
> flat datasource table more efficiently, source dataset will be repartitioned
> to the same amount of partitions as the dictionary's bucket size. It
> sometimes bring side effect, because repartition by a single column is more
> likely to cause data skew.
> We have a case that a topN/count_distinct measure has serious data skew. The
> dataset get skewed after repartition, causing one task takes the majority of
> time in first layer's cuboid build job.
> !image-2021-03-24-17-37-57-505.png!
> To improve this case, we add a step to repartition the encoded dataset by all
> RowKey columns, and the first layer's build time reduced from 20min to 4min.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)