[jira] [Commented] (KYLIN-3925) Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to avoid generating small hdfs files

ASF subversion and git services (JIRA) Wed, 10 Apr 2019 05:36:20 -0700


    [ 
https://issues.apache.org/jira/browse/KYLIN-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814418#comment-16814418
 ]


ASF subversion and git services commented on KYLIN-3925:
--------------------------------------------------------

Commit 5316e190acd85f52205b0849a0d8689004900c1b in kylin's branch 
refs/heads/master from kyotoYaho
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=5316e19 ]

KYLIN-3925 Add reduce step for FilterRecommendCuboidDataJob & 
UpdateOldCuboidShardJob to avoid generating small hdfs files


> Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to 
> avoid generating small hdfs files
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-3925
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3925
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>             Fix For: v3.0.0
>
>
> Previously when doing cube optimization, there're two map only MR jobs: 
> *FilterRecommendCuboidDataJob* & *UpdateOldCuboidShardJob*. The benefit of 
> map only job is to avoid shuffling. However, this benefit will bring a more 
> severe issue, too many small hdfs files.
> Suppose there're 10 hdfs files for current cuboids data and each with 500M. 
> If the block size is 100M, there'll be 10*(500/100) mappers for the map only 
> job *FilterRecommendCuboidDataJob*. Each mapper will generate a hdfs file. 
> Finally there'll be 50 hdfs files. Since the job 
> *FilterRecommendCuboidDataJob* will filter out the cuboid data used for 
> future, the data size of each file will be less than 100M. In some cases, it 
> will be even less than 50M.
> To avoid this kind of small hdfs file issue, it's better to add a reduce step 
> to control the final output hdfs file number.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KYLIN-3925) Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to avoid generating small hdfs files

Reply via email to