RE: Combine data for more throughput

Namit Jain Mon, 06 Jul 2009 14:32:21 -0700

You don't need a reducer.

explain
 INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
 select col1,col2...
 from raw_web_data  where log_date_part='2009-07-05';

You should see a conditional task, which should automatically merge the small 
files.

Can you send the output of explain plan ?

Also, can you send:

1. The number of mappers the above query needed.
2. The total size of the output (raw_web_data_seq)

If the average size of output is < 1G, it will be automatically concatenated.

Thanks,
-namit

-----Original Message-----
From: Edward Capriolo [mailto:[email protected]] 
Sent: Monday, July 06, 2009 2:27 PM
To: [email protected]
Subject: Re: Combine data for more throughput

Ashish,

I update to trunk and tried both approaches.

explain FROM ( select log_date,log_time,remote_ip...
from raw_web_data  where log_date_part='2009-07-05' ) a
INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
REDUCE a.log_date, a.log_time, a.remote_ip...
USING '/bin/cat'
as
log_date,log_time,remote_ip,....

The problem with this method seems that set mapred.reduce.tasks=X has
no effect on the number of reducers.

2009-07-06 17:00:14,027 INFO org.apache.hadoop.mapred.ReduceTask:
Ignoring obsolete output of FAILED map-task:
'attempt_200905271425_1447_m_001387_1'
2009-07-06 17:00:14,027 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200905271425_1447_r_000000_0: Got 41 obsolete map-outputs from
tasktracker
2009-07-06 17:00:16,053 WARN org.apache.hadoop.mapred.TaskRunner:
Parent died.  Exiting attempt_200905271425_1447_r_000000_0

The reduce task never seems to progress. Then it dies. It is a fairly
big dataset. This could be an OOM issue. I also tried making the
second table to not be sequence file. I run into the same problem.

Any other hints? using /bin/cat is what you meant by identity reducer?
Should I just use hadoop and do an identity mapper and identity reduce
for this problem?

Thank you,
Edward

On Mon, Jul 6, 2009 at 2:01 PM, Namit Jain<[email protected]> wrote:
> hive.merge.mapfiles is set to true by default.
> So, in trunk, you should get small output files.
> Can you do a explain plan and send it if that is not the case ?
>
> -----Original Message-----
> From: Ashish Thusoo [mailto:[email protected]]
> Sent: Monday, July 06, 2009 10:50 AM
> To: [email protected]
> Subject: RE: Combine data for more throughput
>
> Namit recently added a facility to concatenate the files. The problem here is 
> that the filter is running in the mapper.
>
> In trunk if you set
>
> set hive.merge.mapfiles=true
>
> That should do the trick
>
> In 0.3.0 you can send the output of the select to an Identity Reducer to get 
> the same effect by using the REDUCE syntax..
>
> Ashish
> ________________________________________
> From: Edward Capriolo [[email protected]]
> Sent: Monday, July 06, 2009 9:47 AM
> To: [email protected]
> Subject: Combine data for more throughput
>
> I am currently pulling our 5 minute logs into a Hive table. This
> results in a partition with ~4,000 tiny files in text format about 4MB
> per file, per day.
>
> I have created a table with an identical number of column  with
> 'STORED AS SEQUENCEFILE'. My goal is to use sequence file and merge
> the smaller files into larger files. This should put less stress on my
> name node and better performance. I am doing this:
>
> INSERT OVERWRITE TABLE raw_web_data_seq PARTITION (log_date_part='2009-07-05')
> select col1,col2...
> from raw_web_data  where log_date_part='2009-07-05';
>
> This does not do what I need as I end up with about 4000 'attempt'
> files like 'attempt_200905271425_1382_m_004318_0'.
> Does anyone have some tips on transforming raw data into the
> "fastest/best" possible format? Schema tips would be helpful, but I am
> really looking to merge up smaller files and chose a fast format, seq
> LZo whatever.
>
> Thanks
>

RE: Combine data for more throughput

Reply via email to