RE: inefficient execution plan in this case

Ashish Thusoo Tue, 26 May 2009 10:50:50 -0700

Hi Min,

Can you give the exact query you tried? Hive does support queries with 
subqueries in the from clause. Namit's query was missing a select clause, so if 
you put that in, it should work...

Ashish

________________________________
From: Min Zhou [mailto:[email protected]]
Sent: Monday, May 25, 2009 11:59 PM
To: [email protected]
Subject: Re: inefficient execution plan in this case

Thanks Namit.
I don't think hive can parse this kind of queries you given including 
subqueries in the from clause. You will get message from hive syntax parser 
like below,
FAILED: Parse Error: line 41:2 mismatched input 't1' expecting EOF
Even though,  temporary tables can used here helping hive generating 5 
map-reduce jobs. but plans on joins isnot the hot spot I cared.

First query with many unions lead to a map only job. If is executes like you 
explained,  scan table only once, why it cost more than 4 minutes while mine 
whole statistics using raw mapreduce need only 2 minutes, and mappers took less 
than 1 minute?

On Tue, May 26, 2009 at 1:42 PM, Namit Jain 
<[email protected]<mailto:[email protected]>> wrote:
Can you do a explain on the queries ? First query looks OK - 1 map-reduce job - 
The table should be scanned only once and not N times.

The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for JOINs.

The way you have written the join query is wrong - it will lead to cartesian 
products of t1 and t2, and then the result of t1 and t2 with t3
and then the join.

Rewrite as:

>From t1 join t2 on t1.type=t2.type
Join t3 on t1.type = t3.type
Join t4 on t1.type = t4.type

It will lead to 5 map-reduce jobs

Predicate push down should fix that - can you file a jira for that, this join 
should be optimized with predicate pushdown.

On 5/25/09 7:43 PM, "Min Zhou" 
<[email protected]<mailto:[email protected]>> wrote:

Hi all,

We had a mapreduce job scaning records and generating perhaps more than 10 
<key,value> pairs each record.  Here is mappers' pseudo-code
map(key, value, ouput, reporter) {
   if(key.contains(a)) {
    output.collect('class_a', value)
  }
  if(key.contains(b)) {
    output.collect('class_b', value)
  }
  ...
  if(key.contains(z)) {
    output.collect('class_z', value)
  }
}

Reducers did some statisticals on the ouput of mappers.  The whole mapreduce 
job paid us only 2 minutes.  But we need more than 10 minutes using union all 
query in hive like below:

create table tmp( type string, other_cols' definition)

insert overwrite table tmp
select * from(
  select
    'class_a',
    other_cols
  from
    tbl
  where key.contains(a)

  union all
  select
    'class_b',
    other_cols
  from
    tbl
  where key.contains(b)

  ...

  union all
  select
    'class_z',
    other_cols
  from
    tbl
  where key.contains(z)
) t;

create table result_tbl(type string, count1 int, count2 int, count3 int, count4 
int);

insert overwrite table result_tbl
select
  t1.type,
  t1.count1,
  t2.count2,
  t3.count3,
  t4.count4
from
  (select
    type,
    distinct statistics_1 as count1
  from
    tmp
  group by
    type) t1

  join
  (select
    type,
    distinct statistics_2 as count2
  from
    tmp
  group by
    type) t2

  join
  (select
    type,
    distinct statistics_3 as count3
  from
    tmp
  group by
    type) t3

  join
  (select
    type,
    distinct statistics_4 as count4
  from
    tmp
  group by
    type) t4

  on t1.type=t2.type
    and t1.type=t3.type
    and t1.type=t4.type

First query incuding many unions needed 1 mapreduce jobs, but seemed scan table 
N times, where N is the number of unioned select statements.
Second query need 7 mapreduce jobs, obviouslly, more time needed.

 Is there anything wrong about us when using Hive? Do you have solution on that 
issue? Thanks in advanced!

Yours,
Min

Regards,
Min
--
My research interests are distributed systems, parallel computing and bytecode 
based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

RE: inefficient execution plan in this case

Reply via email to