RE: merging the size of the reduce output

John Sichi Thu, 01 Jul 2010 23:59:51 -0700

Ning is currently out on vacation; I think he'll be back to working on this 
when he returns.

JVS

________________________________________
From: Viraj Bhat [[email protected]]
Sent: Thursday, July 01, 2010 11:40 PM
To: [email protected]
Subject: RE: merging the size of the reduce output

Okay I read that this is a work in progress
https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files when 
doing dynamic partitioning.
There was a suggestion to try:
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat for 
Hadoop 20 when running queries on this partition.
Viraj

________________________________
From: Viraj Bhat [mailto:[email protected]]
Sent: Thursday, July 01, 2010 11:31 PM
To: [email protected]
Cc: [email protected]
Subject: RE: merging the size of the reduce output

Hi Yongqiang,
 I am facing a similar situation, I am using the latest trunk of Hive. I am 
using dynamic partitioning of Hive and it is a Map only job, which converts 
files from compressed TXT gz to RC format.
The DDL of the task looks similar to:

FROM gztable

INSERT OVERWRITE TABLE  rctable

…
PARTITION(datestamp, partitionlevel1, partitionlevel1)

SELECT …

..
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=256000000;
set hive.merge.size.smallfiles.avgsize=256000000;

When I run a job, I see that the following are set to false in the job.xml when 
the job starts up.
hive.merge.mapfiles = false;
hive.merge.mapredfiles = false;

Is this a bug with dynamic partitioning?  Is there something else I need to set 
to get this to work and remove small files I might be generating.

Viraj

________________________________
From: Yongqiang He [mailto:[email protected]]
Sent: Sunday, June 13, 2010 10:56 PM
To: [email protected]
Subject: Re: merging the size of the reduce output

I think there is another parameter “hive.merge.smallfiles.avgsize”  to see 
whether to do the merge job or not based on the average output files’ size. The 
default for that parameter is 16M. So if the average output’s size is larger 
than 16M, will not merge.
Maybe you can try to increase that value to see.

Thanks
Yongqiang
On 6/13/10 10:41 PM, "Sammy Yu" <[email protected]> wrote:
Hi,
   I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to true 
via the shell tool and hive-default.xml configuration file.  However, it 
appears somehow the job configuration is changed before the job is submitted.  
Is there another condition that can cause this to happen?

Thanks,
Sammy

On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu <[email protected]> wrote:
Looking at ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java, 
hive.merge.mapredfiles is effective if there is a reducer for your job.
Otherwise you should have set hive.merge.mapfiles to true.

On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu <[email protected]> wrote:
Hi,
   I'm running the latest version of trunk r953172.  I'm doing doing a dynamic 
partition insert overwrite query which generates a lot of small files in each 
of the partition.  I was hoping this could be solved by setting 
hive.merge.mapredfiles to true.  However, it seems like whenever the job is 
submitted it is always set to false, thus it doesnt seem to have any effect.  I 
also tried to modified this property in the hive-default.xml, but it didn't 
work either.

Thanks,
Sammy

RE: merging the size of the reduce output

Reply via email to