Try the following::
insert overwrite table result_tbl
select
t1.type,
t1.count1,
t2.count2,
t3.count3,
t4.count4
from
(select
type,
distinct statistics_1 as count1
from
tmp
group by
type) t1
join
(select
type,
distinct statistics_2 as count2
from
tmp
group by
type) t2
on t1.type = t2.type
join
(select
type,
distinct statistics_3 as count3
from
tmp
group by
type) t3
on t1.type = t3.type
join
(select
type,
distinct statistics_4 as count4
from
tmp
group by
type) t4
on t1.type=t4.type
It should work and result in 5 map-reduce jobs
From: Ashish Thusoo [mailto:[email protected]]
Sent: Tuesday, May 26, 2009 10:50 AM
To: [email protected]
Subject: RE: inefficient execution plan in this case
Hi Min,
Can you give the exact query you tried? Hive does support queries with
subqueries in the from clause. Namit's query was missing a select clause, so if
you put that in, it should work...
Ashish
________________________________
From: Min Zhou [mailto:[email protected]]
Sent: Monday, May 25, 2009 11:59 PM
To: [email protected]
Subject: Re: inefficient execution plan in this case
Thanks Namit.
I don't think hive can parse this kind of queries you given including
subqueries in the from clause. You will get message from hive syntax parser
like below,
FAILED: Parse Error: line 41:2 mismatched input 't1' expecting EOF
Even though, temporary tables can used here helping hive generating 5
map-reduce jobs. but plans on joins isnot the hot spot I cared.
First query with many unions lead to a map only job. If is executes like you
explained, scan table only once, why it cost more than 4 minutes while mine
whole statistics using raw mapreduce need only 2 minutes, and mappers took less
than 1 minute?
On Tue, May 26, 2009 at 1:42 PM, Namit Jain
<[email protected]<mailto:[email protected]>> wrote:
Can you do a explain on the queries ? First query looks OK - 1 map-reduce job -
The table should be scanned only once and not N times.
The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for JOINs.
The way you have written the join query is wrong - it will lead to cartesian
products of t1 and t2, and then the result of t1 and t2 with t3
and then the join.
Rewrite as:
>From t1 join t2 on t1.type=t2.type
Join t3 on t1.type = t3.type
Join t4 on t1.type = t4.type
It will lead to 5 map-reduce jobs
Predicate push down should fix that - can you file a jira for that, this join
should be optimized with predicate pushdown.
On 5/25/09 7:43 PM, "Min Zhou"
<[email protected]<mailto:[email protected]>> wrote:
Hi all,
We had a mapreduce job scaning records and generating perhaps more than 10
<key,value> pairs each record. Here is mappers' pseudo-code
map(key, value, ouput, reporter) {
if(key.contains(a)) {
output.collect('class_a', value)
}
if(key.contains(b)) {
output.collect('class_b', value)
}
...
if(key.contains(z)) {
output.collect('class_z', value)
}
}
Reducers did some statisticals on the ouput of mappers. The whole mapreduce
job paid us only 2 minutes. But we need more than 10 minutes using union all
query in hive like below:
create table tmp( type string, other_cols' definition)
insert overwrite table tmp
select * from(
select
'class_a',
other_cols
from
tbl
where key.contains(a)
union all
select
'class_b',
other_cols
from
tbl
where key.contains(b)
...
union all
select
'class_z',
other_cols
from
tbl
where key.contains(z)
) t;
create table result_tbl(type string, count1 int, count2 int, count3 int, count4
int);
insert overwrite table result_tbl
select
t1.type,
t1.count1,
t2.count2,
t3.count3,
t4.count4
from
(select
type,
distinct statistics_1 as count1
from
tmp
group by
type) t1
join
(select
type,
distinct statistics_2 as count2
from
tmp
group by
type) t2
join
(select
type,
distinct statistics_3 as count3
from
tmp
group by
type) t3
join
(select
type,
distinct statistics_4 as count4
from
tmp
group by
type) t4
on t1.type=t2.type
and t1.type=t3.type
and t1.type=t4.type
First query incuding many unions needed 1 mapreduce jobs, but seemed scan table
N times, where N is the number of unioned select statements.
Second query need 7 mapreduce jobs, obviouslly, more time needed.
Is there anything wrong about us when using Hive? Do you have solution on that
issue? Thanks in advanced!
Yours,
Min
Regards,
Min
--
My research interests are distributed systems, parallel computing and bytecode
based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com