You don't need a reducer.
explain INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05') select col1,col2... from raw_web_data where log_date_part='2009-07-05'; You should see a conditional task, which should automatically merge the small files. Can you send the output of explain plan ? Also, can you send: 1. The number of mappers the above query needed. 2. The total size of the output (raw_web_data_seq) If the average size of output is < 1G, it will be automatically concatenated. Thanks, -namit -----Original Message----- From: Edward Capriolo [mailto:[email protected]] Sent: Monday, July 06, 2009 2:27 PM To: [email protected] Subject: Re: Combine data for more throughput Ashish, I update to trunk and tried both approaches. explain FROM ( select log_date,log_time,remote_ip... from raw_web_data where log_date_part='2009-07-05' ) a INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05') REDUCE a.log_date, a.log_time, a.remote_ip... USING '/bin/cat' as log_date,log_time,remote_ip,.... The problem with this method seems that set mapred.reduce.tasks=X has no effect on the number of reducers. 2009-07-06 17:00:14,027 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring obsolete output of FAILED map-task: 'attempt_200905271425_1447_m_001387_1' 2009-07-06 17:00:14,027 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200905271425_1447_r_000000_0: Got 41 obsolete map-outputs from tasktracker 2009-07-06 17:00:16,053 WARN org.apache.hadoop.mapred.TaskRunner: Parent died. Exiting attempt_200905271425_1447_r_000000_0 The reduce task never seems to progress. Then it dies. It is a fairly big dataset. This could be an OOM issue. I also tried making the second table to not be sequence file. I run into the same problem. Any other hints? using /bin/cat is what you meant by identity reducer? Should I just use hadoop and do an identity mapper and identity reduce for this problem? Thank you, Edward On Mon, Jul 6, 2009 at 2:01 PM, Namit Jain<[email protected]> wrote: > hive.merge.mapfiles is set to true by default. > So, in trunk, you should get small output files. > Can you do a explain plan and send it if that is not the case ? > > -----Original Message----- > From: Ashish Thusoo [mailto:[email protected]] > Sent: Monday, July 06, 2009 10:50 AM > To: [email protected] > Subject: RE: Combine data for more throughput > > Namit recently added a facility to concatenate the files. The problem here is > that the filter is running in the mapper. > > In trunk if you set > > set hive.merge.mapfiles=true > > That should do the trick > > In 0.3.0 you can send the output of the select to an Identity Reducer to get > the same effect by using the REDUCE syntax.. > > Ashish > ________________________________________ > From: Edward Capriolo [[email protected]] > Sent: Monday, July 06, 2009 9:47 AM > To: [email protected] > Subject: Combine data for more throughput > > I am currently pulling our 5 minute logs into a Hive table. This > results in a partition with ~4,000 tiny files in text format about 4MB > per file, per day. > > I have created a table with an identical number of column with > 'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge > the smaller files into larger files. This should put less stress on my > name node and better performance. I am doing this: > > INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05') > select col1,col2... > from raw_web_data where log_date_part='2009-07-05'; > > This does not do what I need as I end up with about 4000 'attempt' > files like 'attempt_200905271425_1382_m_004318_0'. > Does anyone have some tips on transforming raw data into the > "fastest/best" possible format? Schema tips would be helpful, but I am > really looking to merge up smaller files and chose a fast format, seq > LZo whatever. > > Thanks >
