Re: inefficient execution plan in this case

Min Zhou Mon, 25 May 2009 23:59:43 -0700

Thanks Namit.
I don't think hive can parse this kind of queries you given including
subqueries in the from clause. You will get message from hive syntax parser
like below,
FAILED: Parse Error: line 41:2 mismatched input 't1' expecting EOF
Even though,  temporary tables can used here helping hive generating 5
map-reduce jobs. but plans on joins isnot the hot spot I cared.


First query with many unions lead to a map only job. If is executes like you
explained,  scan table only once, why it cost more than 4 minutes while mine
whole statistics using raw mapreduce need only 2 minutes, and mappers took
less than 1 minute?



On Tue, May 26, 2009 at 1:42 PM, Namit Jain <[email protected]> wrote:

>  Can you do a explain on the queries ? First query looks OK – 1 map-reduce
> job — The table should be scanned only once and not N times.
>
> The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for
> JOINs.
>
> The way you have written the join query is wrong – it will lead to
> cartesian products of t1 and t2, and then the result of t1 and t2 with t3
> and then the join.
>
>
> Rewrite as:
>
> From t1 join t2 on t1.type=t2.type
> Join t3 on t1.type = t3.type
> Join t4 on t1.type = t4.type
>
> It will lead to 5 map-reduce jobs
>
>
> Predicate push down should fix that – can you file a jira for that, this
> join should be optimized with predicate pushdown.
>
>
>
>
>
> On 5/25/09 7:43 PM, "Min Zhou" <[email protected]> wrote:
>
> Hi all,
>
> We had a mapreduce job scaning records and generating perhaps more than 10
> <key,value> pairs each record.  Here is mappers' pseudo-code
> map(key, value, ouput, reporter) {
>    if(key.contains(a)) {
>     output.collect('class_a', value)
>   }
>   if(key.contains(b)) {
>     output.collect('class_b', value)
>   }
>   ...
>   if(key.contains(z)) {
>     output.collect('class_z', value)
>   }
> }
>
> Reducers did some statisticals on the ouput of mappers.  The whole
> mapreduce job paid us only 2 minutes.  But we need more than 10 minutes
> using union all query in hive like below:
>
> create table tmp( type string, other_cols' definition)
>
> insert overwrite table tmp
> select * from(
>   select
>     'class_a',
>     other_cols
>   from
>     tbl
>   where key.contains(a)
>
>   union all
>   select
>     'class_b',
>     other_cols
>   from
>     tbl
>   where key.contains(b)
>
>   ...
>
>   union all
>   select
>     'class_z',
>     other_cols
>   from
>     tbl
>   where key.contains(z)
> ) t;
>
> create table result_tbl(type string, count1 int, count2 int, count3 int,
> count4 int);
>
> insert overwrite table result_tbl
> select
>   t1.type,
>   t1.count1,
>   t2.count2,
>   t3.count3,
>   t4.count4
> from
>   (select
>     type,
>     distinct statistics_1 as count1
>   from
>     tmp
>   group by
>     type) t1
>
>   join
>   (select
>     type,
>     distinct statistics_2 as count2
>   from
>     tmp
>   group by
>     type) t2
>
>   join
>   (select
>     type,
>     distinct statistics_3 as count3
>   from
>     tmp
>   group by
>     type) t3
>
>   join
>   (select
>     type,
>     distinct statistics_4 as count4
>   from
>     tmp
>   group by
>     type) t4
>
>   on t1.type=t2.type
>     and t1.type=t3.type
>     and t1.type=t4.type
>
> First query incuding many unions needed 1 mapreduce jobs, but seemed scan
> table N times, where N is the number of unioned select statements.
> Second query need 7 mapreduce jobs, obviouslly, more time needed.
>
>  Is there anything wrong about us when using Hive? Do you have solution on
> that issue? Thanks in advanced!
>
> Yours,
> Min
>
>
Regards,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: inefficient execution plan in this case

Reply via email to