Re: Join optimization for star schemas

Namit Jain Tue, 14 Jul 2009 08:22:40 -0700

You are doing 2 joins which are not getting merged because the join keys are 
different.

 1.  Map Join is the best way to get around this, both the dimension tables 
will be in memory. This should work for you.
 2.  In order to make sure that fact table is the outer most table, break up 
the queries:

select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

Change it to:

Insert overwrite table tmp1
select * from d1 join f on d1.d1_key = f.d1_key;

select tmp1.d1_value, d2.d2_value, sum(measure) from
d2 join tmp1 on d2.d2_key = tmp1.d2_key group by tmp1.d1_value, d2.d2_value;

-namit

On 7/14/09 8:11 AM, "David Lerman" <[email protected]> wrote:

Thanks.  Namit, map-join looks promising.  He, I'm not quite sure I'm
following yet.

In our (simplified) case, we have a fact table with many millions of rows
(call it f), which joins to two dimension tables (call them d1 and d2), each
of which has at most a few thousand rows.

So:
create table f(f_key int, d1_key int, d2_key int, measure int);
create table d1(d1_key int, d1_value string);
create table d2(d2_key int, d2_value string);

The facts are split among the dimensions extremely disproportionately, so
although d2 has 1000 rows, 90% of the facts are linked to the top 10.

The query is:
select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

We're finding that though we have 100 reducers, 98 of them are executing
very quickly while 2 get stuck with the vast majority of the reduce records
and crash with out-of-memory exceptions even though they have multiple gigs
of memory allocated.

This raises two questions:

1.  With this kind of join (small x big x small), what's the best way to
keep the memory usage down?  Map-join definitely looks promising, and He,
I'd like to understand the index push down join approach so we can try it
out as well.

2.  I'm assuming we're getting those 2 reducers stuck with the whole load
because the reduce records get partitioned by the join key and the vast
majority of records use just a few keys.  Is there anything we can do about
this?

Thanks so much!
Dave

On 7/14/09 8:51 AM, "He Yongqiang" <[email protected]> wrote:

> mapside join is the most efficient, but as Namit mentioned, it also has some
> limitation.
>
>>> If you dimension tables are small, you can use map-join.
> Can use temporary table for map-join if a dimension table  is not small but
> predicates on that table can dramatically reduce number of rows.
>
> Another optimization you can experiment is so called “index push down join”.
> Since hive does not support index ( and of course no bitmap/bitvector) right
> now, you can mimic the techniques by yourself.  The key is to pre-execute
> several join tables,  and to find some ways first join small tables and then
> join with the fact table.
>
> create table dimension1_fact(d1_key, fact_key, attributes of dimension table
> d1 which are mostly used in queries );
> create table dimension2_fact(d2_key, fact_key, attributes of dimension table
> d2 which are mostly used in queries );
> etc ...
> And you can translate you queries to these manually created tables.
> The query looks like:
> Select .... from  big_table  join (select factkey,... From dimension1_fact
> join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
> big_table.key = join_dimension_table .factkey
> Can not sure this can reduce the execution time, and may increase the
> execution time (you can do some experiments :) ).
>
>
> On 09-7-14 下午1:45, "Namit Jain" <[email protected]> wrote:
>
>> The large table is only a problem if the number of values for a given key are
>> very large – they are stored in memory.
>>
>> If you dimension tables are small, you can use map-join. That way, no
>> reduction is needed. You need to specify the hint in the select clause
>> and the list of tables which are small.
>>
>>
>> For eg:
>>
>>
>> select /*+ MAPJOIN(smalll_table1, small_table2) */  ... from small_table1
>> join
>> big_table on ... join small_table2 on ...
>>
>>
>> Thanks,
>> -namit
>>
>>
>> On 7/13/09 9:44 PM, "Jason Michael" <[email protected]> wrote:
>>
>>> In our hive instance, we have one large fact-type table that joins to
>>> several
>>> dimension tables on integer keys.  I know from reading the Language Manual
>>> that in ordering joins it is best to join the largest table last in the
>>> sequence in order to minimize memory usage.  This won’t work in the
>>> situation
>>> where you want to join the large fact table to more than one dimension.
>>> Something like:
>>>
>>> select ... from small_table1 join big_table on ... join small_table2 on ...
>>>
>>> I have to imagine this is a pretty common pattern, is there any guidance for
>>> doing this sort of star schema join?
>>>
>>>
>>

Re: Join optimization for star schemas

Reply via email to