[ https://issues.apache.org/jira/browse/HIVE-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281255#comment-16281255 ]
Hengyu Dai commented on HIVE-18234: ----------------------------------- I have patched [https://issues.apache.org/jira/browse/HIVE-15178] on Hive 2.1.1 and it works! Thanks [~aihuaxu] ! > Hive MergeFileTask doesn't work correctly > ----------------------------------------- > > Key: HIVE-18234 > URL: https://issues.apache.org/jira/browse/HIVE-18234 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 2.1.1 > Reporter: Hengyu Dai > Assignee: Aihua Xu > > For MergeFileTask, Hive will read hive.merge.mapfiles, > hive.merge.mapredfiles, hive.merge.size.per.task, > hive.merge.smallfiles.avgsize these property to determine whether to generate > a MergeFileTask to merge small files, if merge is needed, then hive will > generate a MergeFileTask/MapWork to merge files, the property will finally be > set to MapWork#maxSplitSize, maxSplitSize#minSplitSize, > maxSplitSize#minSplitSizePerNode, minSplitSizePerRack#minSplitSizePerRack. > But Hive doesn't use these settings when commit Map task to Hadoop, i.e., > the corresponding settings of Hadoop: "mapred.max.split.size" > "mapred.min.split.size.per.node" "mapred.min.split.size.per.rack" are not set > by these Hive setting. SO, those Hive setting does not take effect for > MergeFileTask. > steps to reproduce: > this sql will still produce many small files(less than 20MB) > {code:sql} > set hive.merge.mapredfiles=true; > set hive.merge.mapfiles=true; > set hive.merge.smallfiles.avgsize=500000000; > set hive.merge.size.per.task=1000000000; > insert overwrite table foo partition(dt='20171203') > select * from bar; > {code} > to fix this problem, I think we should set these property to Hadoop in > MergeFileTask, > the following code works me > {code:java} > // in MergeFileTask#execute() > job.setInputFormat(work.getInputformatClass()); > job.setOutputFormat(HiveOutputFormatImpl.class); > job.setMapperClass(MergeFileMapper.class); > job.setMapOutputKeyClass(NullWritable.class); > job.setMapOutputValueClass(NullWritable.class); > job.setOutputKeyClass(NullWritable.class); > job.setOutputValueClass(NullWritable.class); > job.setNumReduceTasks(0); > // set these property > job.setLong("mapred.max.split.size", work.getMaxSplitSize()); > job.setLong("mapred.min.split.size.per.rack", > work.getMinSplitSizePerRack()); > job.setLong("mapred.min.split.size.per.node", > work.getMinSplitSizePerNode()); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)