>> > Using CombineHiveInputFormat in a map-only job to merge small files is a good > idea. Actually this is what HIVE-1307 will do. I'm not aware of the signature > difference in Cloudera's Hadoop distribution. The Hive's createPool() > signature is compatible with Hadoop 0.20.2 API and the future HIVE-1307 patch > should also stay with the same API. So you may want to ask on the Cloudera > forum to see if it can be supported.
Cloudera deprecated org.apache.hadoop.mapred.lib.CombineFileInputFormat (see http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html) and made it inherit from the (non-deprecated) org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat, which has different method signatures because the new API doesn't use JobConf. Apache Hadoop 0.20.2 does _not_ implement CombineFileInputFormat for the new API, but 0.21 does. Note that in addition, Hadoop 0.21 explicitly deprecates the old org.apache.hadoop.mapred.lib.CombineFileInputFormat, and makes it inherit from org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat. However, unlike Cloudera, Apache provides wrappers to preserve method signatures. So is there any plan to migrate to the new API, or are you happy with using the deprecated (as of 0.21) API, provided that it's backwards compatible?.. > >> --Leo >> >> On Fri, Aug 13, 2010 at 8:45 PM, Ning Zhang <[email protected]> wrote: >>> The second map-reduce job is probably the merge job which takes the output >>> of the first map-only job (the real query) and merge the resulting files. >>> The merge job is not always triggered. If you look at the plan you may find >>> it is a child of a conditional task, which means it is conditionally >>> triggered based on the results of the first map-only job. >>> >>> You can control to not run the merge task by setting >>> hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to >>> control whether to merge the result of a map-reduce job. >>> >>> On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote: >>> >>>> Hi all, >>>> I'm mystified by Hive's behavior for two types of queries. >>>> >>>> 1: consider the following simple select query: >>>> insert overwrite table alogs_test_extracted1 >>>> select raw.client_ip, raw.cookie, raw.referrer_flag >>>> from alogs_test_rc6 raw; >>>> Both tables are stored as rcfiles, and LZO compression is turned on. >>>> >>>> Hive runs this in two jobs: a map-only, and a map-reduce. Question: >>>> can someone explain to me _what_ hive is doing in the two map jobs?.. >>>> I stared at the output of EXPLAIN, but can't figure out what is going >>>> on. When I do similar extractions by hand, I have a mapper that pulls >>>> out fields from records, and (optionally) a reducer that combines the >>>> results -- that is, one map stage. Why are there two here?.. (about >>>> 30% of the time is spent on the first map stage, 45% on the second map >>>> stage, and 25% on the reduce step). >>>> >>>> 2: consider the "transform..using" query below: >>>> insert overwrite table alogs_test_rc6 >>>> select >>>> transform (d.ll) >>>> using 'java myProcessingClass' >>>> as (field1, field2, field3) >>>> from (select logline as ll from raw_log_test1day) d; >>>> >>>> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a >>>> map, and a map-reduce. However, when the job actually runs, Hive says >>>> "Launching job 1 out of 2", runs the transform script in mappers, >>>> writes the table, and never launches job 2 (the map-reduce stage in >>>> the plan)! Why is this happening, and can I control this behavior?.. >>>> Sometimes it would be preferable for me to run a map-only job (perhaps >>>> combining input data for mappers with CombineFileInputFormat to avoid >>>> generating thousands of 20MB files). >>>> >>>> Thanks in advance to anyone who can clarify Hive's behavior here... >>>> --Leo >>> >>> > >
