Hi all,
We had a mapreduce job scaning records and generating perhaps more than 10
<key,value> pairs each record. Here is mappers' pseudo-code
map(key, value, ouput, reporter) {
if(key.contains(a)) {
output.collect('class_a', value)
}
if(key.contains(b)) {
output.collect('class_b', value)
}
...
if(key.contains(z)) {
output.collect('class_z', value)
}
}
Reducers did some statisticals on the ouput of mappers. The whole mapreduce
job paid us only 2 minutes. But we need more than 10 minutes using union
all query in hive like below:
create table tmp( type string, other_cols' definition)
insert overwrite table tmp
select * from(
select
'class_a',
other_cols
from
tbl
where key.contains(a)
union all
select
'class_b',
other_cols
from
tbl
where key.contains(b)
...
union all
select
'class_z',
other_cols
from
tbl
where key.contains(z)
) t;
create table result_tbl(type string, count1 int, count2 int, count3 int,
count4 int);
insert overwrite table result_tbl
select
t1.type,
t1.count1,
t2.count2,
t3.count3,
t4.count4
from
(select
type,
distinct statistics_1 as count1
from
tmp
group by
type) t1
join
(select
type,
distinct statistics_2 as count2
from
tmp
group by
type) t2
join
(select
type,
distinct statistics_3 as count3
from
tmp
group by
type) t3
join
(select
type,
distinct statistics_4 as count4
from
tmp
group by
type) t4
on t1.type=t2.type
and t1.type=t3.type
and t1.type=t4.type
First query incuding many unions needed 1 mapreduce jobs, but seemed scan
table N times, where N is the number of unioned select statements.
Second query need 7 mapreduce jobs, obviouslly, more time needed.
Is there anything wrong about us when using Hive? Do you have solution on
that issue? Thanks in advanced!
Yours,
Min
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com