The second map-reduce job is probably the merge job which takes the output of the first map-only job (the real query) and merge the resulting files. The merge job is not always triggered. If you look at the plan you may find it is a child of a conditional task, which means it is conditionally triggered based on the results of the first map-only job.
You can control to not run the merge task by setting hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to control whether to merge the result of a map-reduce job. On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote: > Hi all, > I'm mystified by Hive's behavior for two types of queries. > > 1: consider the following simple select query: > insert overwrite table alogs_test_extracted1 > select raw.client_ip, raw.cookie, raw.referrer_flag > from alogs_test_rc6 raw; > Both tables are stored as rcfiles, and LZO compression is turned on. > > Hive runs this in two jobs: a map-only, and a map-reduce. Question: > can someone explain to me _what_ hive is doing in the two map jobs?.. > I stared at the output of EXPLAIN, but can't figure out what is going > on. When I do similar extractions by hand, I have a mapper that pulls > out fields from records, and (optionally) a reducer that combines the > results -- that is, one map stage. Why are there two here?.. (about > 30% of the time is spent on the first map stage, 45% on the second map > stage, and 25% on the reduce step). > > 2: consider the "transform..using" query below: > insert overwrite table alogs_test_rc6 > select > transform (d.ll) > using 'java myProcessingClass' > as (field1, field2, field3) > from (select logline as ll from raw_log_test1day) d; > > Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a > map, and a map-reduce. However, when the job actually runs, Hive says > "Launching job 1 out of 2", runs the transform script in mappers, > writes the table, and never launches job 2 (the map-reduce stage in > the plan)! Why is this happening, and can I control this behavior?.. > Sometimes it would be preferable for me to run a map-only job (perhaps > combining input data for mappers with CombineFileInputFormat to avoid > generating thousands of 20MB files). > > Thanks in advance to anyone who can clarify Hive's behavior here... > --Leo
