On Aug 13, 2010, at 9:52 PM, Leo Alekseyev wrote: > Ning, thanks -- I can indeed force a map-only task with > hive.merge.mapfiles=false. However, I'm still curious what triggers > whether or not the merge MR job is run?.. In my original message I > gave two sample queries; I believe hive.merge.mapfiles was set to true > for both of them. But for the first one, the merge MR job ran, while > for the second, Hive only ran the first map stage and then printed > something like Ended Job = 590224440, job is filtered out (removed at > runtime). > Whether a merge MR job is triggered is determined by 1) if there are more than 1 files and 2) if the average size of the files are less than hive.merge.smallfiles.avgsize (default = 16MB). These are runtime conditions so the merge job are filtered out if one of the conditions is false.
> Also, would you recommend CombineFileInputFormat with map-only jobs to > better control the number of chunks on the output?.. Right now I seem > to have a choice between having 10,000 20MB files or merge into larger > files but increasing my compute time by x3 in the merge MR job. (As a > side note, CombineFileInputFormat doesn't work with Cloudera's Hadoop > 0.20.1 due to some different method signatures in createPool(...), so > I want to make sure it's worth getting it to work before I start > making major changes to our deployment.) > Using CombineHiveInputFormat in a map-only job to merge small files is a good idea. Actually this is what HIVE-1307 will do. I'm not aware of the signature difference in Cloudera's Hadoop distribution. The Hive's createPool() signature is compatible with Hadoop 0.20.2 API and the future HIVE-1307 patch should also stay with the same API. So you may want to ask on the Cloudera forum to see if it can be supported. > --Leo > > On Fri, Aug 13, 2010 at 8:45 PM, Ning Zhang <[email protected]> wrote: >> The second map-reduce job is probably the merge job which takes the output >> of the first map-only job (the real query) and merge the resulting files. >> The merge job is not always triggered. If you look at the plan you may find >> it is a child of a conditional task, which means it is conditionally >> triggered based on the results of the first map-only job. >> >> You can control to not run the merge task by setting >> hive.merge.mapfiles=false. Likewise hive.merge.mapredfiles is used to >> control whether to merge the result of a map-reduce job. >> >> On Aug 13, 2010, at 8:16 PM, Leo Alekseyev wrote: >> >>> Hi all, >>> I'm mystified by Hive's behavior for two types of queries. >>> >>> 1: consider the following simple select query: >>> insert overwrite table alogs_test_extracted1 >>> select raw.client_ip, raw.cookie, raw.referrer_flag >>> from alogs_test_rc6 raw; >>> Both tables are stored as rcfiles, and LZO compression is turned on. >>> >>> Hive runs this in two jobs: a map-only, and a map-reduce. Question: >>> can someone explain to me _what_ hive is doing in the two map jobs?.. >>> I stared at the output of EXPLAIN, but can't figure out what is going >>> on. When I do similar extractions by hand, I have a mapper that pulls >>> out fields from records, and (optionally) a reducer that combines the >>> results -- that is, one map stage. Why are there two here?.. (about >>> 30% of the time is spent on the first map stage, 45% on the second map >>> stage, and 25% on the reduce step). >>> >>> 2: consider the "transform..using" query below: >>> insert overwrite table alogs_test_rc6 >>> select >>> transform (d.ll) >>> using 'java myProcessingClass' >>> as (field1, field2, field3) >>> from (select logline as ll from raw_log_test1day) d; >>> >>> Here, Hive plan (as shown via EXPLAIN) also suggests two MR stages: a >>> map, and a map-reduce. However, when the job actually runs, Hive says >>> "Launching job 1 out of 2", runs the transform script in mappers, >>> writes the table, and never launches job 2 (the map-reduce stage in >>> the plan)! Why is this happening, and can I control this behavior?.. >>> Sometimes it would be preferable for me to run a map-only job (perhaps >>> combining input data for mappers with CombineFileInputFormat to avoid >>> generating thousands of 20MB files). >>> >>> Thanks in advance to anyone who can clarify Hive's behavior here... >>> --Leo >> >>
