Combine data for more throughput

Edward Capriolo Mon, 06 Jul 2009 09:47:38 -0700

I am currently pulling our 5 minute logs into a Hive table. This
results in a partition with ~4,000 tiny files in text format about 4MB
per file, per day.


I have created a table with an identical number of column  with
'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge
the smaller files into larger files. This should put less stress on my
name node and better performance. I am doing this:

INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
select col1,col2...
from raw_web_data  where log_date_part='2009-07-05';

This does not do what I need as I end up with about 4000 'attempt'
files like 'attempt_200905271425_1382_m_004318_0'.
Does anyone have some tips on transforming raw data into the
"fastest/best" possible format? Schema tips would be helpful, but I am
really looking to merge up smaller files and chose a fast format, seq
LZo whatever.

Thanks

Combine data for more throughput

Reply via email to