[ 
https://issues.apache.org/jira/browse/KYLIN-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nichunen resolved KYLIN-3925.
-----------------------------
       Resolution: Fixed
    Fix Version/s:     (was: v3.0.0)
                   v3.0.0-alpha2

> Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to 
> avoid generating small hdfs files
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-3925
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3925
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>             Fix For: v3.0.0-alpha2
>
>
> Previously when doing cube optimization, there're two map only MR jobs: 
> *FilterRecommendCuboidDataJob* & *UpdateOldCuboidShardJob*. The benefit of 
> map only job is to avoid shuffling. However, this benefit will bring a more 
> severe issue, too many small hdfs files.
> Suppose there're 10 hdfs files for current cuboids data and each with 500M. 
> If the block size is 100M, there'll be 10*(500/100) mappers for the map only 
> job *FilterRecommendCuboidDataJob*. Each mapper will generate a hdfs file. 
> Finally there'll be 50 hdfs files. Since the job 
> *FilterRecommendCuboidDataJob* will filter out the cuboid data used for 
> future, the data size of each file will be less than 100M. In some cases, it 
> will be even less than 50M.
> To avoid this kind of small hdfs file issue, it's better to add a reduce step 
> to control the final output hdfs file number.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to