Re: merging the size of the reduce output

John Sichi Tue, 06 Jul 2010 18:28:24 -0700

I'm sure Ning will appreciate any help you can give, so if you make progress, 
feel free to upload an updated patch.


JVS

On Jul 2, 2010, at 4:44 PM, Viraj Bhat wrote:

> Hi John,
> Thanks again for letting me know. This came be overcome though by using
> the CombineInputFormat, unfortunately I am not using that branch ;) 
> Also a large number of small files for some partitions cause poor
> utilization to the Namenode.
> Please let me know if you need help with the patch.
> Thanks
> Viraj
> 
> -----Original Message-----
> From: John Sichi [mailto:[email protected]] 
> Sent: Thursday, July 01, 2010 11:57 PM
> To: [email protected]
> Subject: RE: merging the size of the reduce output
> 
> Ning is currently out on vacation; I think he'll be back to working on
> this when he returns.
> 
> JVS
> 
> ________________________________________
> From: Viraj Bhat [[email protected]]
> Sent: Thursday, July 01, 2010 11:40 PM
> To: [email protected]
> Subject: RE: merging the size of the reduce output
> 
> Okay I read that this is a work in progress
> https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files
> when doing dynamic partitioning.
> There was a suggestion to try:
> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> for Hadoop 20 when running queries on this partition.
> Viraj
> 
> ________________________________
> From: Viraj Bhat [mailto:[email protected]]
> Sent: Thursday, July 01, 2010 11:31 PM
> To: [email protected]
> Cc: [email protected]
> Subject: RE: merging the size of the reduce output
> 
> Hi Yongqiang,
> I am facing a similar situation, I am using the latest trunk of Hive. I
> am using dynamic partitioning of Hive and it is a Map only job, which
> converts files from compressed TXT gz to RC format.
> The DDL of the task looks similar to:
> 
> FROM gztable
> 
> INSERT OVERWRITE TABLE  rctable
> 
> ...
> PARTITION(datestamp, partitionlevel1, partitionlevel1)
> 
> 
> SELECT ...
> 
> 
> ..
> set hive.merge.mapredfiles=true;
> set hive.merge.mapfiles=true;
> set hive.merge.smallfiles.avgsize=256000000;
> set hive.merge.size.smallfiles.avgsize=256000000;
> 
> When I run a job, I see that the following are set to false in the
> job.xml when the job starts up.
> hive.merge.mapfiles = false;
> hive.merge.mapredfiles = false;
> 
> Is this a bug with dynamic partitioning?  Is there something else I need
> to set to get this to work and remove small files I might be generating.
> 
> Viraj
> 
> ________________________________
> From: Yongqiang He [mailto:[email protected]]
> Sent: Sunday, June 13, 2010 10:56 PM
> To: [email protected]
> Subject: Re: merging the size of the reduce output
> 
> I think there is another parameter "hive.merge.smallfiles.avgsize"  to
> see whether to do the merge job or not based on the average output
> files' size. The default for that parameter is 16M. So if the average
> output's size is larger than 16M, will not merge.
> Maybe you can try to increase that value to see.
> 
> Thanks
> Yongqiang
> On 6/13/10 10:41 PM, "Sammy Yu" <[email protected]> wrote:
> Hi,
>   I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to
> true via the shell tool and hive-default.xml configuration file.
> However, it appears somehow the job configuration is changed before the
> job is submitted.  Is there another condition that can cause this to
> happen?
> 
> Thanks,
> Sammy
> 
> 
> On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu <[email protected]> wrote:
> Looking at
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
> hive.merge.mapredfiles is effective if there is a reducer for your job.
> Otherwise you should have set hive.merge.mapfiles to true.
> 
> 
> On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu <[email protected]> wrote:
> Hi,
>   I'm running the latest version of trunk r953172.  I'm doing doing a
> dynamic partition insert overwrite query which generates a lot of small
> files in each of the partition.  I was hoping this could be solved by
> setting hive.merge.mapredfiles to true.  However, it seems like whenever
> the job is submitted it is always set to false, thus it doesnt seem to
> have any effect.  I also tried to modified this property in the
> hive-default.xml, but it didn't work either.
> 
> Thanks,
> Sammy
> 
> 
>

Re: merging the size of the reduce output

Reply via email to