[
https://issues.apache.org/jira/browse/KYLIN-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814418#comment-16814418
]
ASF subversion and git services commented on KYLIN-3925:
--------------------------------------------------------
Commit 5316e190acd85f52205b0849a0d8689004900c1b in kylin's branch
refs/heads/master from kyotoYaho
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=5316e19 ]
KYLIN-3925 Add reduce step for FilterRecommendCuboidDataJob &
UpdateOldCuboidShardJob to avoid generating small hdfs files
> Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to
> avoid generating small hdfs files
> ---------------------------------------------------------------------------------------------------------------
>
> Key: KYLIN-3925
> URL: https://issues.apache.org/jira/browse/KYLIN-3925
> Project: Kylin
> Issue Type: Improvement
> Reporter: Zhong Yanghong
> Assignee: Zhong Yanghong
> Priority: Major
> Fix For: v3.0.0
>
>
> Previously when doing cube optimization, there're two map only MR jobs:
> *FilterRecommendCuboidDataJob* & *UpdateOldCuboidShardJob*. The benefit of
> map only job is to avoid shuffling. However, this benefit will bring a more
> severe issue, too many small hdfs files.
> Suppose there're 10 hdfs files for current cuboids data and each with 500M.
> If the block size is 100M, there'll be 10*(500/100) mappers for the map only
> job *FilterRecommendCuboidDataJob*. Each mapper will generate a hdfs file.
> Finally there'll be 50 hdfs files. Since the job
> *FilterRecommendCuboidDataJob* will filter out the cuboid data used for
> future, the data size of each file will be less than 100M. In some cases, it
> will be even less than 50M.
> To avoid this kind of small hdfs file issue, it's better to add a reduce step
> to control the final output hdfs file number.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)