Re: Join optimization for star schemas

He Yongqiang Tue, 14 Jul 2009 09:02:21 -0700

If the dimension table only got a few thousand rows, then map side join will
be the best.


What I was saying is not real index push down join, you can refer to
http://portal.acm.org/citation.cfm?id=564691.564754 for more details about
index push down join.
My previous description of index push down join is not suitable for your
situation:
1) the dimension table only got a few thousand rows.
2) your fact/dimension table is not fat at all.
3) there is no predicate for dimension table

And an example to explain my previous description:
create table f(f_key int, d1_key int, d2_key int, measure int);
create table d1(d1_key int, d1_value string);
create table d2(d2_key int, d2_value string);

create several view table(these tables can be shared among many related
queries):
create table f_d1(f_key int, d1_key int, d1_attr...);
create table f_d2(f_key int, d2_key int, d2_attr...);

First do f_d1 join f_d2, then join with the big fat fact table. (the big
table is at last and only scan one time).
It assumes f_d1 and f_d2 are much smaller than fact table and dimension
table, which does not hold in your situation, they become much bigger :( .

On 09-7-14 下午11:11, "David Lerman" <[email protected]> wrote:

> Thanks.  Namit, map-join looks promising.  He, I'm not quite sure I'm
> following yet.
> 
> In our (simplified) case, we have a fact table with many millions of rows
> (call it f), which joins to two dimension tables (call them d1 and d2), each
> of which has at most a few thousand rows.
> 
> So:
> create table f(f_key int, d1_key int, d2_key int, measure int);
> create table d1(d1_key int, d1_value string);
> create table d2(d2_key int, d2_value string);
> 
> The facts are split among the dimensions extremely disproportionately, so
> although d2 has 1000 rows, 90% of the facts are linked to the top 10.
> 
> The query is:
> select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
> f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;
> 
> We're finding that though we have 100 reducers, 98 of them are executing
> very quickly while 2 get stuck with the vast majority of the reduce records
> and crash with out-of-memory exceptions even though they have multiple gigs
> of memory allocated.
> 
> This raises two questions:
> 
> 1.  With this kind of join (small x big x small), what's the best way to
> keep the memory usage down?  Map-join definitely looks promising, and He,
> I'd like to understand the index push down join approach so we can try it
> out as well.
> 
> 2.  I'm assuming we're getting those 2 reducers stuck with the whole load
> because the reduce records get partitioned by the join key and the vast
> majority of records use just a few keys.  Is there anything we can do about
> this?
> 
> Thanks so much!
> Dave
> 
> On 7/14/09 8:51 AM, "He Yongqiang" <[email protected]> wrote:
> 
>> mapside join is the most efficient, but as Namit mentioned, it also has some
>> limitation.
>> 
>>>> If you dimension tables are small, you can use map-join.
>> Can use temporary table for map-join if a dimension table  is not small but
>> predicates on that table can dramatically reduce number of rows.
>> 
>> Another optimization you can experiment is so called “index push down join”.
>> Since hive does not support index ( and of course no bitmap/bitvector) right
>> now, you can mimic the techniques by yourself.  The key is to pre-execute
>> several join tables,  and to find some ways first join small tables and then
>> join with the fact table.
>> 
>> create table dimension1_fact(d1_key, fact_key, attributes of dimension table
>> d1 which are mostly used in queries );
>> create table dimension2_fact(d2_key, fact_key, attributes of dimension table
>> d2 which are mostly used in queries );
>> etc ...
>> And you can translate you queries to these manually created tables.
>> The query looks like:
>> Select .... from  big_table  join (select factkey,... From dimension1_fact
>> join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
>> big_table.key = join_dimension_table .factkey
>> Can not sure this can reduce the execution time, and may increase the
>> execution time (you can do some experiments :) ).
>> 
>> 
>> On 09-7-14 下午1:45, "Namit Jain" <[email protected]> wrote:
>> 
>>> The large table is only a problem if the number of values for a given key
>>> are
>>> very large – they are stored in memory.
>>> 
>>> If you dimension tables are small, you can use map-join. That way, no
>>> reduction is needed. You need to specify the hint in the select clause
>>> and the list of tables which are small.
>>> 
>>> 
>>> For eg:
>>> 
>>> 
>>> select /*+ MAPJOIN(smalll_table1, small_table2) */  ... from small_table1
>>> join 
>>> big_table on ... join small_table2 on ...
>>> 
>>> 
>>> Thanks,
>>> -namit
>>> 
>>> 
>>> On 7/13/09 9:44 PM, "Jason Michael" <[email protected]> wrote:
>>> 
>>>> In our hive instance, we have one large fact-type table that joins to
>>>> several 
>>>> dimension tables on integer keys.  I know from reading the Language Manual
>>>> that in ordering joins it is best to join the largest table last in the
>>>> sequence in order to minimize memory usage.  This won’t work in the
>>>> situation 
>>>> where you want to join the large fact table to more than one dimension.
>>>> Something like:
>>>> 
>>>> select ... from small_table1 join big_table on ... join small_table2 on ...
>>>> 
>>>> I have to imagine this is a pretty common pattern, is there any guidance
>>>> for
>>>> doing this sort of star schema join?
>>>> 
>>>> 
>>> 
>

Re: Join optimization for star schemas

Reply via email to