Re: inefficient execution plan in this case

Namit Jain Mon, 25 May 2009 22:43:30 -0700

Can you do a explain on the queries ? First query looks OK - 1 map-reduce job - 
The table should be scanned only once and not N times.


The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for JOINs.

The way you have written the join query is wrong - it will lead to cartesian 
products of t1 and t2, and then the result of t1 and t2 with t3
and then the join.


Rewrite as:

>From t1 join t2 on t1.type=t2.type
Join t3 on t1.type = t3.type
Join t4 on t1.type = t4.type

It will lead to 5 map-reduce jobs


Predicate push down should fix that - can you file a jira for that, this join 
should be optimized with predicate pushdown.




On 5/25/09 7:43 PM, "Min Zhou" <[email protected]> wrote:

Hi all,

We had a mapreduce job scaning records and generating perhaps more than 10 
<key,value> pairs each record.  Here is mappers' pseudo-code
map(key, value, ouput, reporter) {
   if(key.contains(a)) {
    output.collect('class_a', value)
  }
  if(key.contains(b)) {
    output.collect('class_b', value)
  }
  ...
  if(key.contains(z)) {
    output.collect('class_z', value)
  }
}

Reducers did some statisticals on the ouput of mappers.  The whole mapreduce 
job paid us only 2 minutes.  But we need more than 10 minutes using union all 
query in hive like below:

create table tmp( type string, other_cols' definition)

insert overwrite table tmp
select * from(
  select
    'class_a',
    other_cols
  from
    tbl
  where key.contains(a)

  union all
  select
    'class_b',
    other_cols
  from
    tbl
  where key.contains(b)

  ...

  union all
  select
    'class_z',
    other_cols
  from
    tbl
  where key.contains(z)
) t;

create table result_tbl(type string, count1 int, count2 int, count3 int, count4 
int);

insert overwrite table result_tbl
select
  t1.type,
  t1.count1,
  t2.count2,
  t3.count3,
  t4.count4
from
  (select
    type,
    distinct statistics_1 as count1
  from
    tmp
  group by
    type) t1

  join
  (select
    type,
    distinct statistics_2 as count2
  from
    tmp
  group by
    type) t2

  join
  (select
    type,
    distinct statistics_3 as count3
  from
    tmp
  group by
    type) t3

  join
  (select
    type,
    distinct statistics_4 as count4
  from
    tmp
  group by
    type) t4

  on t1.type=t2.type
    and t1.type=t3.type
    and t1.type=t4.type

First query incuding many unions needed 1 mapreduce jobs, but seemed scan table 
N times, where N is the number of unioned select statements.
Second query need 7 mapreduce jobs, obviouslly, more time needed.

 Is there anything wrong about us when using Hive? Do you have solution on that 
issue? Thanks in advanced!

Yours,
Min

Re: inefficient execution plan in this case

Reply via email to