[
https://issues.apache.org/jira/browse/KYLIN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319955#comment-15319955
]
Dayue Gao commented on KYLIN-1677:
----------------------------------
Hi Shaofeng,
Here's the test result of using hive view as fact table:
|| KYLIN-1677 || Time(min) || KYLIN-1656 || Time(min) ||
| Count Source Table | 9.02 | Create Intermediate Flat Hive Table | 8.12 |
| Create Intermediate Flat Hive Table | 12.89 | Redistribute Intermediate Flat
Hive | 2.39 |
As expected, KYLIN-1677 took more time due to materializing view twice instead
of once in KYLIN-1656.
To be fair, I also tested a cube which uses non-view as fact table:
|| KYLIN-1677 || Time(min) || KYLIN-1656 || Time(min) ||
| Count Source Table | 1.10 | Create Intermediate Flat Hive Table | 3.74 |
| Create Intermediate Flat Hive Table | 1.70 | Redistribute Intermediate Flat
Hive | 5.13 |
In this case, KYLIN-1677 behaves better than KYLIN-1656 due to avoiding one
round of MR.
In general, I'm +1 to release KYLIN-1677 as an refinement to KYLIN-1656.
> Distribute source data by certain columns when creating flat table
> ------------------------------------------------------------------
>
> Key: KYLIN-1677
> URL: https://issues.apache.org/jira/browse/KYLIN-1677
> Project: Kylin
> Issue Type: Improvement
> Components: Job Engine
> Reporter: Shaofeng SHI
> Assignee: Shaofeng SHI
> Fix For: v1.5.3
>
>
> Inspired by KYLIN-1656, Kylin can distribute the source data by certain
> columns when creating the flat hive table; Then the data assigned to a mapper
> will have more similarity, more aggregation can happen at mapper side, and
> then less shuffle and reduce is needed.
> Columns can be used for the distribution includes: ultra high cardinality
> column, mandantory column, partition date/time column, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)