inefficient execution plan in this case

Min Zhou Mon, 25 May 2009 19:43:52 -0700

Hi all,

We had a mapreduce job scaning records and generating perhaps more than 10
<key,value> pairs each record.  Here is mappers' pseudo-code
map(key, value, ouput, reporter) {
   if(key.contains(a)) {
    output.collect('class_a', value)
  }
  if(key.contains(b)) {
    output.collect('class_b', value)
  }
  ...
  if(key.contains(z)) {
    output.collect('class_z', value)
  }
}


Reducers did some statisticals on the ouput of mappers.  The whole mapreduce
job paid us only 2 minutes.  But we need more than 10 minutes using union
all query in hive like below:

create table tmp( type string, other_cols' definition)

insert overwrite table tmp
select * from(
  select
    'class_a',
    other_cols
  from
    tbl
  where key.contains(a)

  union all
  select
    'class_b',
    other_cols
  from
    tbl
  where key.contains(b)

  ...

  union all
  select
    'class_z',
    other_cols
  from
    tbl
  where key.contains(z)
) t;

create table result_tbl(type string, count1 int, count2 int, count3 int,
count4 int);

insert overwrite table result_tbl
select
  t1.type,
  t1.count1,
  t2.count2,
  t3.count3,
  t4.count4
from
  (select
    type,
    distinct statistics_1 as count1
  from
    tmp
  group by
    type) t1

  join
  (select
    type,
    distinct statistics_2 as count2
  from
    tmp
  group by
    type) t2

  join
  (select
    type,
    distinct statistics_3 as count3
  from
    tmp
  group by
    type) t3

  join
  (select
    type,
    distinct statistics_4 as count4
  from
    tmp
  group by
    type) t4

  on t1.type=t2.type
    and t1.type=t3.type
    and t1.type=t4.type

First query incuding many unions needed 1 mapreduce jobs, but seemed scan
table N times, where N is the number of unioned select statements.
Second query need 7 mapreduce jobs, obviouslly, more time needed.

Is there anything wrong about us when using Hive? Do you have solution on
that issue? Thanks in advanced!

Yours,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

inefficient execution plan in this case

Reply via email to