Hi Min, Can you give the exact query you tried? Hive does support queries with subqueries in the from clause. Namit's query was missing a select clause, so if you put that in, it should work...
Ashish ________________________________ From: Min Zhou [mailto:[email protected]] Sent: Monday, May 25, 2009 11:59 PM To: [email protected] Subject: Re: inefficient execution plan in this case Thanks Namit. I don't think hive can parse this kind of queries you given including subqueries in the from clause. You will get message from hive syntax parser like below, FAILED: Parse Error: line 41:2 mismatched input 't1' expecting EOF Even though, temporary tables can used here helping hive generating 5 map-reduce jobs. but plans on joins isnot the hot spot I cared. First query with many unions lead to a map only job. If is executes like you explained, scan table only once, why it cost more than 4 minutes while mine whole statistics using raw mapreduce need only 2 minutes, and mappers took less than 1 minute? On Tue, May 26, 2009 at 1:42 PM, Namit Jain <[email protected]<mailto:[email protected]>> wrote: Can you do a explain on the queries ? First query looks OK - 1 map-reduce job - The table should be scanned only once and not N times. The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for JOINs. The way you have written the join query is wrong - it will lead to cartesian products of t1 and t2, and then the result of t1 and t2 with t3 and then the join. Rewrite as: >From t1 join t2 on t1.type=t2.type Join t3 on t1.type = t3.type Join t4 on t1.type = t4.type It will lead to 5 map-reduce jobs Predicate push down should fix that - can you file a jira for that, this join should be optimized with predicate pushdown. On 5/25/09 7:43 PM, "Min Zhou" <[email protected]<mailto:[email protected]>> wrote: Hi all, We had a mapreduce job scaning records and generating perhaps more than 10 <key,value> pairs each record. Here is mappers' pseudo-code map(key, value, ouput, reporter) { if(key.contains(a)) { output.collect('class_a', value) } if(key.contains(b)) { output.collect('class_b', value) } ... if(key.contains(z)) { output.collect('class_z', value) } } Reducers did some statisticals on the ouput of mappers. The whole mapreduce job paid us only 2 minutes. But we need more than 10 minutes using union all query in hive like below: create table tmp( type string, other_cols' definition) insert overwrite table tmp select * from( select 'class_a', other_cols from tbl where key.contains(a) union all select 'class_b', other_cols from tbl where key.contains(b) ... union all select 'class_z', other_cols from tbl where key.contains(z) ) t; create table result_tbl(type string, count1 int, count2 int, count3 int, count4 int); insert overwrite table result_tbl select t1.type, t1.count1, t2.count2, t3.count3, t4.count4 from (select type, distinct statistics_1 as count1 from tmp group by type) t1 join (select type, distinct statistics_2 as count2 from tmp group by type) t2 join (select type, distinct statistics_3 as count3 from tmp group by type) t3 join (select type, distinct statistics_4 as count4 from tmp group by type) t4 on t1.type=t2.type and t1.type=t3.type and t1.type=t4.type First query incuding many unions needed 1 mapreduce jobs, but seemed scan table N times, where N is the number of unioned select statements. Second query need 7 mapreduce jobs, obviouslly, more time needed. Is there anything wrong about us when using Hive? Do you have solution on that issue? Thanks in advanced! Yours, Min Regards, Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
