Thanks Namit. I don't think hive can parse this kind of queries you given including subqueries in the from clause. You will get message from hive syntax parser like below, FAILED: Parse Error: line 41:2 mismatched input 't1' expecting EOF Even though, temporary tables can used here helping hive generating 5 map-reduce jobs. but plans on joins isnot the hot spot I cared.
First query with many unions lead to a map only job. If is executes like you explained, scan table only once, why it cost more than 4 minutes while mine whole statistics using raw mapreduce need only 2 minutes, and mappers took less than 1 minute? On Tue, May 26, 2009 at 1:42 PM, Namit Jain <[email protected]> wrote: > Can you do a explain on the queries ? First query looks OK – 1 map-reduce > job — The table should be scanned only once and not N times. > > The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for > JOINs. > > The way you have written the join query is wrong – it will lead to > cartesian products of t1 and t2, and then the result of t1 and t2 with t3 > and then the join. > > > Rewrite as: > > From t1 join t2 on t1.type=t2.type > Join t3 on t1.type = t3.type > Join t4 on t1.type = t4.type > > It will lead to 5 map-reduce jobs > > > Predicate push down should fix that – can you file a jira for that, this > join should be optimized with predicate pushdown. > > > > > > On 5/25/09 7:43 PM, "Min Zhou" <[email protected]> wrote: > > Hi all, > > We had a mapreduce job scaning records and generating perhaps more than 10 > <key,value> pairs each record. Here is mappers' pseudo-code > map(key, value, ouput, reporter) { > if(key.contains(a)) { > output.collect('class_a', value) > } > if(key.contains(b)) { > output.collect('class_b', value) > } > ... > if(key.contains(z)) { > output.collect('class_z', value) > } > } > > Reducers did some statisticals on the ouput of mappers. The whole > mapreduce job paid us only 2 minutes. But we need more than 10 minutes > using union all query in hive like below: > > create table tmp( type string, other_cols' definition) > > insert overwrite table tmp > select * from( > select > 'class_a', > other_cols > from > tbl > where key.contains(a) > > union all > select > 'class_b', > other_cols > from > tbl > where key.contains(b) > > ... > > union all > select > 'class_z', > other_cols > from > tbl > where key.contains(z) > ) t; > > create table result_tbl(type string, count1 int, count2 int, count3 int, > count4 int); > > insert overwrite table result_tbl > select > t1.type, > t1.count1, > t2.count2, > t3.count3, > t4.count4 > from > (select > type, > distinct statistics_1 as count1 > from > tmp > group by > type) t1 > > join > (select > type, > distinct statistics_2 as count2 > from > tmp > group by > type) t2 > > join > (select > type, > distinct statistics_3 as count3 > from > tmp > group by > type) t3 > > join > (select > type, > distinct statistics_4 as count4 > from > tmp > group by > type) t4 > > on t1.type=t2.type > and t1.type=t3.type > and t1.type=t4.type > > First query incuding many unions needed 1 mapreduce jobs, but seemed scan > table N times, where N is the number of unioned select statements. > Second query need 7 mapreduce jobs, obviouslly, more time needed. > > Is there anything wrong about us when using Hive? Do you have solution on > that issue? Thanks in advanced! > > Yours, > Min > > Regards, Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
