Hi all,
I'm mystified by Hive's behavior for two types of queries.
1: consider the following simple select query:
insert overwrite table alogs_test_extracted1
select raw.client_ip, raw.cookie, raw.referrer_flag
from alogs_test_rc6 raw;
Both tables are stored as rcfiles, and LZO compression is turned on.
Hive runs this in two jobs: a map-only, and a map-reduce. Question:
can someone explain to me _what_ hive is doing in the two map jobs?..
I stared at the output of EXPLAIN, but can't figure out what is going
on. When I do similar extractions by hand, I have a mapper that pulls
out fields from records, and (optionally) a reducer that combines the
results -- that is, one map stage. Why are there two here?.. (about
30% of the time is spent on the first map stage, 45% on the second map
stage, and 25% on the reduce step).
2: consider the "transform..using" query below:
insert overwrite table alogs_test_rc6
select
transform (d.ll)
using 'java myProcessingClass'
as (field1, field2, field3)
from (select logline as ll from raw_log_test1day) d;
Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
map, and a map-reduce. However, when the job actually runs, Hive says
"Launching job 1 out of 2", runs the transform script in mappers,
writes the table, and never launches job 2 (the map-reduce stage in
the plan)! Why is this happening, and can I control this behavior?..
Sometimes it would be preferable for me to run a map-only job (perhaps
combining input data for mappers with CombineFileInputFormat to avoid
generating thousands of 20MB files).
Thanks in advance to anyone who can clarify Hive's behavior here...
--Leo