[ 
https://issues.apache.org/jira/browse/KYLIN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297779#comment-15297779
 ] 

Shaofeng SHI commented on KYLIN-1677:
-------------------------------------

Hi Dayue, good question; 

1) the RedistributeFlatHiveTableStep will do another round of write after the 
FlatHiveTableStep; as we know the write is much heavier than the read 
opeartion; so I tried to write the data at one time with the distribution; A 
count step is added before the FlatHiveTableStep, to estimate the reducer 
numbers; ideally the count step should be fast as the network io is small. I 
don't have a test on the view, do you have some nubmers? 

2) you're right for "inner join" it is not accurate, while that doesn't need be 
accurate, a rough number is good. Only counting from fact table in this step is 
just to minimalize the cluster effort ("join" need extra effort for hive).

We can continue evaluatING this; if it causes remarkable performance issue, we 
can switch back to the version of KYLIN-1656.

> Distribute source data by certain columns when creating flat table
> ------------------------------------------------------------------
>
>                 Key: KYLIN-1677
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1677
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>            Reporter: Shaofeng SHI
>            Assignee: Shaofeng SHI
>             Fix For: v1.5.3
>
>
> Inspired by KYLIN-1656, Kylin can distribute the source data by certain 
> columns when creating the flat hive table; Then the data assigned to a mapper 
> will have more similarity, more aggregation can happen at mapper side, and 
> then less shuffle and reduce is needed.
> Columns can be used for the distribution includes: ultra high cardinality 
> column, mandantory column, partition date/time column, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to