Ning is currently out on vacation; I think he'll be back to working on this when he returns.
JVS ________________________________________ From: Viraj Bhat [[email protected]] Sent: Thursday, July 01, 2010 11:40 PM To: [email protected] Subject: RE: merging the size of the reduce output Okay I read that this is a work in progress https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files when doing dynamic partitioning. There was a suggestion to try: hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat for Hadoop 20 when running queries on this partition. Viraj ________________________________ From: Viraj Bhat [mailto:[email protected]] Sent: Thursday, July 01, 2010 11:31 PM To: [email protected] Cc: [email protected] Subject: RE: merging the size of the reduce output Hi Yongqiang, I am facing a similar situation, I am using the latest trunk of Hive. I am using dynamic partitioning of Hive and it is a Map only job, which converts files from compressed TXT gz to RC format. The DDL of the task looks similar to: FROM gztable INSERT OVERWRITE TABLE rctable … PARTITION(datestamp, partitionlevel1, partitionlevel1) SELECT … .. set hive.merge.mapredfiles=true; set hive.merge.mapfiles=true; set hive.merge.smallfiles.avgsize=256000000; set hive.merge.size.smallfiles.avgsize=256000000; When I run a job, I see that the following are set to false in the job.xml when the job starts up. hive.merge.mapfiles = false; hive.merge.mapredfiles = false; Is this a bug with dynamic partitioning? Is there something else I need to set to get this to work and remove small files I might be generating. Viraj ________________________________ From: Yongqiang He [mailto:[email protected]] Sent: Sunday, June 13, 2010 10:56 PM To: [email protected] Subject: Re: merging the size of the reduce output I think there is another parameter “hive.merge.smallfiles.avgsize” to see whether to do the merge job or not based on the average output files’ size. The default for that parameter is 16M. So if the average output’s size is larger than 16M, will not merge. Maybe you can try to increase that value to see. Thanks Yongqiang On 6/13/10 10:41 PM, "Sammy Yu" <[email protected]> wrote: Hi, I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to true via the shell tool and hive-default.xml configuration file. However, it appears somehow the job configuration is changed before the job is submitted. Is there another condition that can cause this to happen? Thanks, Sammy On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu <[email protected]> wrote: Looking at ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java, hive.merge.mapredfiles is effective if there is a reducer for your job. Otherwise you should have set hive.merge.mapfiles to true. On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu <[email protected]> wrote: Hi, I'm running the latest version of trunk r953172. I'm doing doing a dynamic partition insert overwrite query which generates a lot of small files in each of the partition. I was hoping this could be solved by setting hive.merge.mapredfiles to true. However, it seems like whenever the job is submitted it is always set to false, thus it doesnt seem to have any effect. I also tried to modified this property in the hive-default.xml, but it didn't work either. Thanks, Sammy
