[ 
https://issues.apache.org/jira/browse/HIVE-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281255#comment-16281255
 ] 

Hengyu Dai commented on HIVE-18234:
-----------------------------------

I have patched [https://issues.apache.org/jira/browse/HIVE-15178] on Hive 2.1.1 
and it works! Thanks [~aihuaxu] !

> Hive MergeFileTask doesn't work correctly
> -----------------------------------------
>
>                 Key: HIVE-18234
>                 URL: https://issues.apache.org/jira/browse/HIVE-18234
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.1.1
>            Reporter: Hengyu Dai
>            Assignee: Aihua Xu
>
> For MergeFileTask, Hive will read hive.merge.mapfiles, 
> hive.merge.mapredfiles, hive.merge.size.per.task, 
> hive.merge.smallfiles.avgsize these property to determine whether to generate 
> a MergeFileTask to merge small files,  if merge is needed, then hive will 
> generate a MergeFileTask/MapWork to merge files, the property will finally be 
> set to MapWork#maxSplitSize, maxSplitSize#minSplitSize, 
> maxSplitSize#minSplitSizePerNode, minSplitSizePerRack#minSplitSizePerRack.
> But Hive doesn't use these settings when commit Map task to Hadoop, i.e.,  
> the corresponding settings of Hadoop: "mapred.max.split.size" 
> "mapred.min.split.size.per.node" "mapred.min.split.size.per.rack" are not set 
> by these Hive setting. SO,  those Hive setting does not take effect for 
> MergeFileTask.
> steps to reproduce:
> this sql will still produce many small files(less than 20MB)
> {code:sql}
> set hive.merge.mapredfiles=true;
> set hive.merge.mapfiles=true;
> set hive.merge.smallfiles.avgsize=500000000;
> set hive.merge.size.per.task=1000000000;
> insert overwrite table foo partition(dt='20171203')
> select * from bar;
> {code}
> to fix this problem, I think we should set these property to Hadoop in 
> MergeFileTask,
> the following code works me
> {code:java}
>       // in MergeFileTask#execute()
>       job.setInputFormat(work.getInputformatClass());
>       job.setOutputFormat(HiveOutputFormatImpl.class);
>       job.setMapperClass(MergeFileMapper.class);
>       job.setMapOutputKeyClass(NullWritable.class);
>       job.setMapOutputValueClass(NullWritable.class);
>       job.setOutputKeyClass(NullWritable.class);
>       job.setOutputValueClass(NullWritable.class);
>       job.setNumReduceTasks(0);
>       // set these property 
>       job.setLong("mapred.max.split.size", work.getMaxSplitSize());
>       job.setLong("mapred.min.split.size.per.rack", 
> work.getMinSplitSizePerRack());
>       job.setLong("mapred.min.split.size.per.node", 
> work.getMinSplitSizePerNode());
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to