Can you do a explain on the queries ? First query looks OK - 1 map-reduce job - 
The table should be scanned only once and not N times.

The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for JOINs.

The way you have written the join query is wrong - it will lead to cartesian 
products of t1 and t2, and then the result of t1 and t2 with t3
and then the join.


Rewrite as:

>From t1 join t2 on t1.type=t2.type
Join t3 on t1.type = t3.type
Join t4 on t1.type = t4.type

It will lead to 5 map-reduce jobs


Predicate push down should fix that - can you file a jira for that, this join 
should be optimized with predicate pushdown.




On 5/25/09 7:43 PM, "Min Zhou" <[email protected]> wrote:

Hi all,

We had a mapreduce job scaning records and generating perhaps more than 10 
<key,value> pairs each record.  Here is mappers' pseudo-code
map(key, value, ouput, reporter) {
   if(key.contains(a)) {
    output.collect('class_a', value)
  }
  if(key.contains(b)) {
    output.collect('class_b', value)
  }
  ...
  if(key.contains(z)) {
    output.collect('class_z', value)
  }
}

Reducers did some statisticals on the ouput of mappers.  The whole mapreduce 
job paid us only 2 minutes.  But we need more than 10 minutes using union all 
query in hive like below:

create table tmp( type string, other_cols' definition)

insert overwrite table tmp
select * from(
  select
    'class_a',
    other_cols
  from
    tbl
  where key.contains(a)

  union all
  select
    'class_b',
    other_cols
  from
    tbl
  where key.contains(b)

  ...

  union all
  select
    'class_z',
    other_cols
  from
    tbl
  where key.contains(z)
) t;

create table result_tbl(type string, count1 int, count2 int, count3 int, count4 
int);

insert overwrite table result_tbl
select
  t1.type,
  t1.count1,
  t2.count2,
  t3.count3,
  t4.count4
from
  (select
    type,
    distinct statistics_1 as count1
  from
    tmp
  group by
    type) t1

  join
  (select
    type,
    distinct statistics_2 as count2
  from
    tmp
  group by
    type) t2

  join
  (select
    type,
    distinct statistics_3 as count3
  from
    tmp
  group by
    type) t3

  join
  (select
    type,
    distinct statistics_4 as count4
  from
    tmp
  group by
    type) t4

  on t1.type=t2.type
    and t1.type=t3.type
    and t1.type=t4.type

First query incuding many unions needed 1 mapreduce jobs, but seemed scan table 
N times, where N is the number of unioned select statements.
Second query need 7 mapreduce jobs, obviouslly, more time needed.

 Is there anything wrong about us when using Hive? Do you have solution on that 
issue? Thanks in advanced!

Yours,
Min

Reply via email to