Can you do a explain on the queries ? First query looks OK - 1 map-reduce job - The table should be scanned only once and not N times.
The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for JOINs. The way you have written the join query is wrong - it will lead to cartesian products of t1 and t2, and then the result of t1 and t2 with t3 and then the join. Rewrite as: >From t1 join t2 on t1.type=t2.type Join t3 on t1.type = t3.type Join t4 on t1.type = t4.type It will lead to 5 map-reduce jobs Predicate push down should fix that - can you file a jira for that, this join should be optimized with predicate pushdown. On 5/25/09 7:43 PM, "Min Zhou" <[email protected]> wrote: Hi all, We had a mapreduce job scaning records and generating perhaps more than 10 <key,value> pairs each record. Here is mappers' pseudo-code map(key, value, ouput, reporter) { if(key.contains(a)) { output.collect('class_a', value) } if(key.contains(b)) { output.collect('class_b', value) } ... if(key.contains(z)) { output.collect('class_z', value) } } Reducers did some statisticals on the ouput of mappers. The whole mapreduce job paid us only 2 minutes. But we need more than 10 minutes using union all query in hive like below: create table tmp( type string, other_cols' definition) insert overwrite table tmp select * from( select 'class_a', other_cols from tbl where key.contains(a) union all select 'class_b', other_cols from tbl where key.contains(b) ... union all select 'class_z', other_cols from tbl where key.contains(z) ) t; create table result_tbl(type string, count1 int, count2 int, count3 int, count4 int); insert overwrite table result_tbl select t1.type, t1.count1, t2.count2, t3.count3, t4.count4 from (select type, distinct statistics_1 as count1 from tmp group by type) t1 join (select type, distinct statistics_2 as count2 from tmp group by type) t2 join (select type, distinct statistics_3 as count3 from tmp group by type) t3 join (select type, distinct statistics_4 as count4 from tmp group by type) t4 on t1.type=t2.type and t1.type=t3.type and t1.type=t4.type First query incuding many unions needed 1 mapreduce jobs, but seemed scan table N times, where N is the number of unioned select statements. Second query need 7 mapreduce jobs, obviouslly, more time needed. Is there anything wrong about us when using Hive? Do you have solution on that issue? Thanks in advanced! Yours, Min
