Re: Why two map stages for a simple select query?

Ning Zhang Sun, 15 Aug 2010 15:25:29 -0700

On Aug 13, 2010, at 9:52 PM, Leo Alekseyev wrote:

> Ning, thanks -- I can indeed force a map-only task with
> hive.merge.mapfiles=false.  However, I'm still curious what triggers
> whether or not the merge MR job is run?..  In my original message I
> gave two sample queries; I believe hive.merge.mapfiles was set to true
> for both of them.  But for the first one, the merge MR job ran, while
> for the second, Hive only ran the first map stage and then printed
> something like  Ended Job = 590224440, job is filtered out (removed at
> runtime).
> 
Whether a merge MR job is triggered is determined by 1) if there are more than 
1 files and 2) if the average size of the files are less than 
hive.merge.smallfiles.avgsize (default = 16MB). These are runtime conditions so 
the merge job are filtered out if one of the conditions is false.



> Also, would you recommend CombineFileInputFormat with map-only jobs to
> better control the number of chunks on the output?..  Right now I seem
> to have a choice between having 10,000 20MB files or merge into larger
> files but increasing my compute time by x3 in the merge MR job.  (As a
> side note, CombineFileInputFormat doesn't work with Cloudera's Hadoop
> 0.20.1 due to some different method signatures in createPool(...), so
> I want to make sure it's worth getting it to work before I start
> making major changes to our deployment.)
> 
Using CombineHiveInputFormat in a map-only job to merge small files is a good 
idea. Actually this is what HIVE-1307 will do. I'm not aware of the signature 
difference in Cloudera's Hadoop distribution. The Hive's createPool() signature 
is compatible with Hadoop 0.20.2 API and the future HIVE-1307 patch should also 
stay with the same API. So you may want to ask on the Cloudera forum to see if 
it can be supported.  

> --Leo
> 
> On Fri, Aug 13, 2010 at 8:45 PM, Ning Zhang <[email protected]> wrote:
>> The second map-reduce job is probably the merge job which takes the output 
>> of the first map-only job (the real query) and merge the resulting files. 
>> The merge job is not always triggered. If you look at the plan you may find 
>> it is a child of a conditional task, which means it is conditionally 
>> triggered based on the results of the first map-only job.
>> 
>> You can control to not run the merge task by setting 
>> hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to 
>> control whether to merge the result of a map-reduce job.
>> 
>> On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote:
>> 
>>> Hi all,
>>> I'm mystified by Hive's behavior for two types of queries.
>>> 
>>> 1: consider the following simple select query:
>>> insert overwrite table alogs_test_extracted1
>>> select raw.client_ip, raw.cookie, raw.referrer_flag
>>> from alogs_test_rc6 raw;
>>> Both tables are stored as rcfiles, and LZO compression is turned on.
>>> 
>>> Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
>>> can someone explain to me _what_ hive is doing in the two map jobs?..
>>> I stared at the output of EXPLAIN, but can't figure out what is going
>>> on.  When I do similar extractions by hand, I have a mapper that pulls
>>> out fields from records, and (optionally) a reducer that combines the
>>> results -- that is, one map stage.  Why are there two here?..  (about
>>> 30% of the time is spent on the first map stage, 45% on the second map
>>> stage, and 25% on the reduce step).
>>> 
>>> 2: consider the "transform..using" query below:
>>> insert overwrite table alogs_test_rc6
>>> select
>>>  transform (d.ll)
>>>    using 'java myProcessingClass'
>>>    as (field1, field2, field3)
>>> from (select logline as ll from raw_log_test1day) d;
>>> 
>>> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
>>> map, and a map-reduce.  However, when the job actually runs, Hive says
>>> "Launching job 1 out of 2", runs the transform script in mappers,
>>> writes the table, and never launches job 2 (the map-reduce stage in
>>> the plan)!  Why is this happening, and can I control this behavior?..
>>> Sometimes it would be preferable for me to run a map-only job (perhaps
>>> combining input data for mappers with CombineFileInputFormat to avoid
>>> generating thousands of 20MB files).
>>> 
>>> Thanks in advance to anyone who can clarify Hive's behavior here...
>>> --Leo
>> 
>>

Re: Why two map stages for a simple select query?

Reply via email to