[
https://issues.apache.org/jira/browse/KYLIN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288673#comment-15288673
]
Shaofeng SHI commented on KYLIN-1677:
-------------------------------------
Based on KYLIN-1656, did a couple of small changes:
1) count the number of records from fact table directly; this will be the first
step of a cube build;
2) distribute the record in the create flat hive table step; (no need 2 steps)
3) if the cube has a "shared by" column, ditribute by that column's value;
4) otherwise, distribute by random;
The mandantory column and partition column wasn't picked to distribute, as not
sure the cardinality of them are high enough for a distribution, but usually
the "shard by" column is.
> Distribute source data by certain columns when creating flat table
> ------------------------------------------------------------------
>
> Key: KYLIN-1677
> URL: https://issues.apache.org/jira/browse/KYLIN-1677
> Project: Kylin
> Issue Type: Improvement
> Components: Job Engine
> Reporter: Shaofeng SHI
> Assignee: Shaofeng SHI
> Fix For: v1.5.3
>
>
> Inspired by KYLIN-1656, Kylin can distribute the source data by certain
> columns when creating the flat hive table; Then the data assigned to a mapper
> will have more similarity, more aggregation can happen at mapper side, and
> then less shuffle and reduce is needed.
> Columns can be used for the distribution includes: ultra high cardinality
> column, mandantory column, partition date/time column, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)