We have a plan to migrate to the new mapreduce API, but probably not very soon.
On Aug 18, 2010, at 1:13 AM, Leo Alekseyev wrote: >>> >> Using CombineHiveInputFormat in a map-only job to merge small files is a >> good idea. Actually this is what HIVE-1307 will do. I'm not aware of the >> signature difference in Cloudera's Hadoop distribution. The Hive's >> createPool() signature is compatible with Hadoop 0.20.2 API and the future >> HIVE-1307 patch should also stay with the same API. So you may want to ask >> on the Cloudera forum to see if it can be supported. > > Cloudera deprecated > org.apache.hadoop.mapred.lib.CombineFileInputFormat (see > http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html) > and made it inherit from the (non-deprecated) > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat, which > has different method signatures because the new API doesn't use > JobConf. Apache Hadoop 0.20.2 does _not_ implement > CombineFileInputFormat for the new API, but 0.21 does. > > Note that in addition, Hadoop 0.21 explicitly deprecates the old > org.apache.hadoop.mapred.lib.CombineFileInputFormat, and makes it > inherit from org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat. > However, unlike Cloudera, Apache provides wrappers to preserve method > signatures. > > So is there any plan to migrate to the new API, or are you happy with > using the deprecated (as of 0.21) API, provided that it's backwards > compatible?.. > > > > >> >>> --Leo >>> >>> On Fri, Aug 13, 2010 at 8:45 PM, Ning Zhang <[email protected]> wrote: >>>> The second map-reduce job is probably the merge job which takes the output >>>> of the first map-only job (the real query) and merge the resulting files. >>>> The merge job is not always triggered. If you look at the plan you may >>>> find it is a child of a conditional task, which means it is conditionally >>>> triggered based on the results of the first map-only job. >>>> >>>> You can control to not run the merge task by setting >>>> hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to >>>> control whether to merge the result of a map-reduce job. >>>> >>>> On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote: >>>> >>>>> Hi all, >>>>> I'm mystified by Hive's behavior for two types of queries. >>>>> >>>>> 1: consider the following simple select query: >>>>> insert overwrite table alogs_test_extracted1 >>>>> select raw.client_ip, raw.cookie, raw.referrer_flag >>>>> from alogs_test_rc6 raw; >>>>> Both tables are stored as rcfiles, and LZO compression is turned on. >>>>> >>>>> Hive runs this in two jobs: a map-only, and a map-reduce. Question: >>>>> can someone explain to me _what_ hive is doing in the two map jobs?.. >>>>> I stared at the output of EXPLAIN, but can't figure out what is going >>>>> on. When I do similar extractions by hand, I have a mapper that pulls >>>>> out fields from records, and (optionally) a reducer that combines the >>>>> results -- that is, one map stage. Why are there two here?.. (about >>>>> 30% of the time is spent on the first map stage, 45% on the second map >>>>> stage, and 25% on the reduce step). >>>>> >>>>> 2: consider the "transform..using" query below: >>>>> insert overwrite table alogs_test_rc6 >>>>> select >>>>> transform (d.ll) >>>>> using 'java myProcessingClass' >>>>> as (field1, field2, field3) >>>>> from (select logline as ll from raw_log_test1day) d; >>>>> >>>>> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a >>>>> map, and a map-reduce. However, when the job actually runs, Hive says >>>>> "Launching job 1 out of 2", runs the transform script in mappers, >>>>> writes the table, and never launches job 2 (the map-reduce stage in >>>>> the plan)! Why is this happening, and can I control this behavior?.. >>>>> Sometimes it would be preferable for me to run a map-only job (perhaps >>>>> combining input data for mappers with CombineFileInputFormat to avoid >>>>> generating thousands of 20MB files). >>>>> >>>>> Thanks in advance to anyone who can clarify Hive's behavior here... >>>>> --Leo >>>> >>>> >> >>
