RE: merging the size of the reduce output

Viraj Bhat Thu, 01 Jul 2010 23:41:30 -0700

Okay I read that this is a work in progress 

https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files
when doing dynamic partitioning.


There was a suggestion to try:

hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
for Hadoop 20 when running queries on this partition.

Viraj

 

________________________________

From: Viraj Bhat [mailto:[email protected]] 
Sent: Thursday, July 01, 2010 11:31 PM
To: [email protected]
Cc: [email protected]
Subject: RE: merging the size of the reduce output

 

Hi Yongqiang,

 I am facing a similar situation, I am using the latest trunk of Hive. I
am using dynamic partitioning of Hive and it is a Map only job, which
converts files from compressed TXT gz to RC format.

The DDL of the task looks similar to:

 

FROM gztable

 

INSERT OVERWRITE TABLE  rctable

 

...

PARTITION(datestamp, partitionlevel1, partitionlevel1)

 

 

SELECT ...

 

 

..

set hive.merge.mapredfiles=true;

set hive.merge.mapfiles=true;

set hive.merge.smallfiles.avgsize=256000000;

set hive.merge.size.smallfiles.avgsize=256000000;

 

When I run a job, I see that the following are set to false in the
job.xml when the job starts up.

hive.merge.mapfiles = false;

hive.merge.mapredfiles = false;

 

Is this a bug with dynamic partitioning?  Is there something else I need
to set to get this to work and remove small files I might be generating.

 

Viraj

 

________________________________

From: Yongqiang He [mailto:[email protected]] 
Sent: Sunday, June 13, 2010 10:56 PM
To: [email protected]
Subject: Re: merging the size of the reduce output

 

I think there is another parameter "hive.merge.smallfiles.avgsize"  to
see whether to do the merge job or not based on the average output
files' size. The default for that parameter is 16M. So if the average
output's size is larger than 16M, will not merge. 
Maybe you can try to increase that value to see.

Thanks
Yongqiang
On 6/13/10 10:41 PM, "Sammy Yu" <[email protected]> wrote:

Hi,
   I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to
true via the shell tool and hive-default.xml configuration file.
However, it appears somehow the job configuration is changed before the
job is submitted.  Is there another condition that can cause this to
happen?

Thanks,
Sammy
 

On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu <[email protected]> wrote:

Looking at
ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
hive.merge.mapredfiles is effective if there is a reducer for your job.
Otherwise you should have set hive.merge.mapfiles to true.


On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu <[email protected]> wrote:

Hi, 
   I'm running the latest version of trunk r953172.  I'm doing doing a
dynamic partition insert overwrite query which generates a lot of small
files in each of the partition.  I was hoping this could be solved by
setting hive.merge.mapredfiles to true.  However, it seems like whenever
the job is submitted it is always set to false, thus it doesnt seem to
have any effect.  I also tried to modified this property in the
hive-default.xml, but it didn't work either. 

Thanks,
Sammy

RE: merging the size of the reduce output

Reply via email to