Hi all,
I'm mystified by Hive's behavior for two types of queries.

1: consider the following simple select query:
insert overwrite table alogs_test_extracted1
select raw.client_ip, raw.cookie, raw.referrer_flag
from alogs_test_rc6 raw;
Both tables are stored as rcfiles, and LZO compression is turned on.

Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
can someone explain to me _what_ hive is doing in the two map jobs?..
I stared at the output of EXPLAIN, but can't figure out what is going
on.  When I do similar extractions by hand, I have a mapper that pulls
out fields from records, and (optionally) a reducer that combines the
results -- that is, one map stage.  Why are there two here?..  (about
30% of the time is spent on the first map stage, 45% on the second map
stage, and 25% on the reduce step).

2: consider the "transform..using" query below:
insert overwrite table alogs_test_rc6
select
  transform (d.ll)
    using 'java myProcessingClass'
    as (field1, field2, field3)
from (select logline as ll from raw_log_test1day) d;

Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
map, and a map-reduce.  However, when the job actually runs, Hive says
"Launching job 1 out of 2", runs the transform script in mappers,
writes the table, and never launches job 2 (the map-reduce stage in
the plan)!  Why is this happening, and can I control this behavior?..
Sometimes it would be preferable for me to run a map-only job (perhaps
combining input data for mappers with CombineFileInputFormat to avoid
generating thousands of 20MB files).

Thanks in advance to anyone who can clarify Hive's behavior here...
--Leo

Reply via email to