[
https://issues.apache.org/jira/browse/KYLIN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297779#comment-15297779
]
Shaofeng SHI commented on KYLIN-1677:
-------------------------------------
Hi Dayue, good question;
1) the RedistributeFlatHiveTableStep will do another round of write after the
FlatHiveTableStep; as we know the write is much heavier than the read
opeartion; so I tried to write the data at one time with the distribution; A
count step is added before the FlatHiveTableStep, to estimate the reducer
numbers; ideally the count step should be fast as the network io is small. I
don't have a test on the view, do you have some nubmers?
2) you're right for "inner join" it is not accurate, while that doesn't need be
accurate, a rough number is good. Only counting from fact table in this step is
just to minimalize the cluster effort ("join" need extra effort for hive).
We can continue evaluatING this; if it causes remarkable performance issue, we
can switch back to the version of KYLIN-1656.
> Distribute source data by certain columns when creating flat table
> ------------------------------------------------------------------
>
> Key: KYLIN-1677
> URL: https://issues.apache.org/jira/browse/KYLIN-1677
> Project: Kylin
> Issue Type: Improvement
> Components: Job Engine
> Reporter: Shaofeng SHI
> Assignee: Shaofeng SHI
> Fix For: v1.5.3
>
>
> Inspired by KYLIN-1656, Kylin can distribute the source data by certain
> columns when creating the flat hive table; Then the data assigned to a mapper
> will have more similarity, more aggregation can happen at mapper side, and
> then less shuffle and reduce is needed.
> Columns can be used for the distribution includes: ultra high cardinality
> column, mandantory column, partition date/time column, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)