Re: Why two map stages for a simple select query?

Ning Zhang Fri, 13 Aug 2010 20:43:55 -0700

The second map-reduce job is probably the merge job which takes the output of 
the first map-only job (the real query) and merge the resulting files. The 
merge job is not always triggered. If you look at the plan you may find it is a 
child of a conditional task, which means it is conditionally triggered based on 
the results of the first map-only job.


You can control to not run the merge task by setting hive.merge.mapfiles=false. 
Likewise hive.merge.mapredfiles is used to control whether to merge the result 
of a map-reduce job. 

On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote:

> Hi all,
> I'm mystified by Hive's behavior for two types of queries.
> 
> 1: consider the following simple select query:
> insert overwrite table alogs_test_extracted1
> select raw.client_ip, raw.cookie, raw.referrer_flag
> from alogs_test_rc6 raw;
> Both tables are stored as rcfiles, and LZO compression is turned on.
> 
> Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
> can someone explain to me _what_ hive is doing in the two map jobs?..
> I stared at the output of EXPLAIN, but can't figure out what is going
> on.  When I do similar extractions by hand, I have a mapper that pulls
> out fields from records, and (optionally) a reducer that combines the
> results -- that is, one map stage.  Why are there two here?..  (about
> 30% of the time is spent on the first map stage, 45% on the second map
> stage, and 25% on the reduce step).
> 
> 2: consider the "transform..using" query below:
> insert overwrite table alogs_test_rc6
> select
>  transform (d.ll)
>    using 'java myProcessingClass'
>    as (field1, field2, field3)
> from (select logline as ll from raw_log_test1day) d;
> 
> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
> map, and a map-reduce.  However, when the job actually runs, Hive says
> "Launching job 1 out of 2", runs the transform script in mappers,
> writes the table, and never launches job 2 (the map-reduce stage in
> the plan)!  Why is this happening, and can I control this behavior?..
> Sometimes it would be preferable for me to run a map-only job (perhaps
> combining input data for mappers with CombineFileInputFormat to avoid
> generating thousands of 20MB files).
> 
> Thanks in advance to anyone who can clarify Hive's behavior here...
> --Leo

Re: Why two map stages for a simple select query?

Reply via email to