RE: Combine data for more throughput

Namit Jain Mon, 06 Jul 2009 11:02:28 -0700

hive.merge.mapfiles is set to true by default.
So, in trunk, you should get small output files.
Can you do a explain plan and send it if that is not the case ?

-----Original Message-----
From: Ashish Thusoo [mailto:[email protected]] 
Sent: Monday, July 06, 2009 10:50 AM
To: [email protected]
Subject: RE: Combine data for more throughput

Namit recently added a facility to concatenate the files. The problem here is 
that the filter is running in the mapper.

In trunk if you set

set hive.merge.mapfiles=true

That should do the trick

In 0.3.0 you can send the output of the select to an Identity Reducer to get 
the same effect by using the REDUCE syntax..

Ashish
________________________________________
From: Edward Capriolo [[email protected]]
Sent: Monday, July 06, 2009 9:47 AM
To: [email protected]
Subject: Combine data for more throughput

I am currently pulling our 5 minute logs into a Hive table. This
results in a partition with ~4,000 tiny files in text format about 4MB
per file, per day.

I have created a table with an identical number of column  with
'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge
the smaller files into larger files. This should put less stress on my
name node and better performance. I am doing this:

INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
select col1,col2...
from raw_web_data  where log_date_part='2009-07-05';

This does not do what I need as I end up with about 4000 'attempt'
files like 'attempt_200905271425_1382_m_004318_0'.
Does anyone have some tips on transforming raw data into the
"fastest/best" possible format? Schema tips would be helpful, but I am
really looking to merge up smaller files and chose a fast format, seq
LZo whatever.

Thanks

RE: Combine data for more throughput

Reply via email to