Namit recently added a facility to concatenate the files. The problem here is that the filter is running in the mapper.
In trunk if you set set hive.merge.mapfiles=true That should do the trick In 0.3.0 you can send the output of the select to an Identity Reducer to get the same effect by using the REDUCE syntax.. Ashish ________________________________________ From: Edward Capriolo [[email protected]] Sent: Monday, July 06, 2009 9:47 AM To: [email protected] Subject: Combine data for more throughput I am currently pulling our 5 minute logs into a Hive table. This results in a partition with ~4,000 tiny files in text format about 4MB per file, per day. I have created a table with an identical number of column with 'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge the smaller files into larger files. This should put less stress on my name node and better performance. I am doing this: INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05') select col1,col2... from raw_web_data where log_date_part='2009-07-05'; This does not do what I need as I end up with about 4000 'attempt' files like 'attempt_200905271425_1382_m_004318_0'. Does anyone have some tips on transforming raw data into the "fastest/best" possible format? Schema tips would be helpful, but I am really looking to merge up smaller files and chose a fast format, seq LZo whatever. Thanks
