Re: Why two map stages for a simple select query?

Ning Zhang Wed, 18 Aug 2010 16:57:15 -0700

We have a plan to migrate to the new mapreduce API, but probably not very soon.


On Aug 18, 2010, at 1:13 AM, Leo Alekseyev wrote:

>>> 
>> Using CombineHiveInputFormat in a map-only job to merge small files is a 
>> good idea. Actually this is what HIVE-1307 will do. I'm not aware of the 
>> signature difference in Cloudera's Hadoop distribution. The Hive's 
>> createPool() signature is compatible with Hadoop 0.20.2 API and the future 
>> HIVE-1307 patch should also stay with the same API. So you may want to ask 
>> on the Cloudera forum to see if it can be supported.
> 
> Cloudera deprecated
> org.apache.hadoop.mapred.lib.CombineFileInputFormat (see
> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html)
> and made it inherit from the (non-deprecated)
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat, which
> has different method signatures because the new API doesn't use
> JobConf.  Apache Hadoop 0.20.2 does _not_ implement
> CombineFileInputFormat for the new API, but 0.21 does.
> 
> Note that in addition, Hadoop 0.21 explicitly deprecates the old
> org.apache.hadoop.mapred.lib.CombineFileInputFormat, and makes it
> inherit from org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.
> However, unlike Cloudera, Apache provides wrappers to preserve method
> signatures.
> 
> So is there any plan to migrate to the new API, or are you happy with
> using the deprecated (as of 0.21) API, provided that it's backwards
> compatible?..
> 
> 
> 
> 
>> 
>>> --Leo
>>> 
>>> On Fri, Aug 13, 2010 at 8:45 PM, Ning Zhang <[email protected]> wrote:
>>>> The second map-reduce job is probably the merge job which takes the output 
>>>> of the first map-only job (the real query) and merge the resulting files. 
>>>> The merge job is not always triggered. If you look at the plan you may 
>>>> find it is a child of a conditional task, which means it is conditionally 
>>>> triggered based on the results of the first map-only job.
>>>> 
>>>> You can control to not run the merge task by setting 
>>>> hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to 
>>>> control whether to merge the result of a map-reduce job.
>>>> 
>>>> On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote:
>>>> 
>>>>> Hi all,
>>>>> I'm mystified by Hive's behavior for two types of queries.
>>>>> 
>>>>> 1: consider the following simple select query:
>>>>> insert overwrite table alogs_test_extracted1
>>>>> select raw.client_ip, raw.cookie, raw.referrer_flag
>>>>> from alogs_test_rc6 raw;
>>>>> Both tables are stored as rcfiles, and LZO compression is turned on.
>>>>> 
>>>>> Hive runs this in two jobs: a map-only, and a map-reduce.  Question:
>>>>> can someone explain to me _what_ hive is doing in the two map jobs?..
>>>>> I stared at the output of EXPLAIN, but can't figure out what is going
>>>>> on.  When I do similar extractions by hand, I have a mapper that pulls
>>>>> out fields from records, and (optionally) a reducer that combines the
>>>>> results -- that is, one map stage.  Why are there two here?..  (about
>>>>> 30% of the time is spent on the first map stage, 45% on the second map
>>>>> stage, and 25% on the reduce step).
>>>>> 
>>>>> 2: consider the "transform..using" query below:
>>>>> insert overwrite table alogs_test_rc6
>>>>> select
>>>>>  transform (d.ll)
>>>>>    using 'java myProcessingClass'
>>>>>    as (field1, field2, field3)
>>>>> from (select logline as ll from raw_log_test1day) d;
>>>>> 
>>>>> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a
>>>>> map, and a map-reduce.  However, when the job actually runs, Hive says
>>>>> "Launching job 1 out of 2", runs the transform script in mappers,
>>>>> writes the table, and never launches job 2 (the map-reduce stage in
>>>>> the plan)!  Why is this happening, and can I control this behavior?..
>>>>> Sometimes it would be preferable for me to run a map-only job (perhaps
>>>>> combining input data for mappers with CombineFileInputFormat to avoid
>>>>> generating thousands of 20MB files).
>>>>> 
>>>>> Thanks in advance to anyone who can clarify Hive's behavior here...
>>>>> --Leo
>>>> 
>>>> 
>> 
>>

Re: Why two map stages for a simple select query?

Reply via email to