I'm sure Ning will appreciate any help you can give, so if you make progress, feel free to upload an updated patch.
JVS On Jul 2, 2010, at 4:44 PM, Viraj Bhat wrote: > Hi John, > Thanks again for letting me know. This came be overcome though by using > the CombineInputFormat, unfortunately I am not using that branch ;) > Also a large number of small files for some partitions cause poor > utilization to the Namenode. > Please let me know if you need help with the patch. > Thanks > Viraj > > -----Original Message----- > From: John Sichi [mailto:[email protected]] > Sent: Thursday, July 01, 2010 11:57 PM > To: [email protected] > Subject: RE: merging the size of the reduce output > > Ning is currently out on vacation; I think he'll be back to working on > this when he returns. > > JVS > > ________________________________________ > From: Viraj Bhat [[email protected]] > Sent: Thursday, July 01, 2010 11:40 PM > To: [email protected] > Subject: RE: merging the size of the reduce output > > Okay I read that this is a work in progress > https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files > when doing dynamic partitioning. > There was a suggestion to try: > hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat > for Hadoop 20 when running queries on this partition. > Viraj > > ________________________________ > From: Viraj Bhat [mailto:[email protected]] > Sent: Thursday, July 01, 2010 11:31 PM > To: [email protected] > Cc: [email protected] > Subject: RE: merging the size of the reduce output > > Hi Yongqiang, > I am facing a similar situation, I am using the latest trunk of Hive. I > am using dynamic partitioning of Hive and it is a Map only job, which > converts files from compressed TXT gz to RC format. > The DDL of the task looks similar to: > > FROM gztable > > INSERT OVERWRITE TABLE rctable > > ... > PARTITION(datestamp, partitionlevel1, partitionlevel1) > > > SELECT ... > > > .. > set hive.merge.mapredfiles=true; > set hive.merge.mapfiles=true; > set hive.merge.smallfiles.avgsize=256000000; > set hive.merge.size.smallfiles.avgsize=256000000; > > When I run a job, I see that the following are set to false in the > job.xml when the job starts up. > hive.merge.mapfiles = false; > hive.merge.mapredfiles = false; > > Is this a bug with dynamic partitioning? Is there something else I need > to set to get this to work and remove small files I might be generating. > > Viraj > > ________________________________ > From: Yongqiang He [mailto:[email protected]] > Sent: Sunday, June 13, 2010 10:56 PM > To: [email protected] > Subject: Re: merging the size of the reduce output > > I think there is another parameter "hive.merge.smallfiles.avgsize" to > see whether to do the merge job or not based on the average output > files' size. The default for that parameter is 16M. So if the average > output's size is larger than 16M, will not merge. > Maybe you can try to increase that value to see. > > Thanks > Yongqiang > On 6/13/10 10:41 PM, "Sammy Yu" <[email protected]> wrote: > Hi, > I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to > true via the shell tool and hive-default.xml configuration file. > However, it appears somehow the job configuration is changed before the > job is submitted. Is there another condition that can cause this to > happen? > > Thanks, > Sammy > > > On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu <[email protected]> wrote: > Looking at > ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java, > hive.merge.mapredfiles is effective if there is a reducer for your job. > Otherwise you should have set hive.merge.mapfiles to true. > > > On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu <[email protected]> wrote: > Hi, > I'm running the latest version of trunk r953172. I'm doing doing a > dynamic partition insert overwrite query which generates a lot of small > files in each of the partition. I was hoping this could be solved by > setting hive.merge.mapredfiles to true. However, it seems like whenever > the job is submitted it is always set to false, thus it doesnt seem to > have any effect. I also tried to modified this property in the > hive-default.xml, but it didn't work either. > > Thanks, > Sammy > > >
