mapside join is the most efficient, but as Namit mentioned, it also has some
limitation.

>>If you dimension tables are small, you can use map-join.
Can use temporary table for map-join if a dimension table  is not small but
predicates on that table can dramatically reduce number of rows.

Another optimization you can experiment is so called “index push down join”.
Since hive does not support index ( and of course no bitmap/bitvector) right
now, you can mimic the techniques by yourself.  The key is to pre-execute
several join tables,  and to find some ways first join small tables and then
join with the fact table.

create table dimension1_fact(d1_key, fact_key, attributes of dimension table
d1 which are mostly used in queries );
create table dimension2_fact(d2_key, fact_key, attributes of dimension table
d2 which are mostly used in queries );
etc ...
And you can translate you queries to these manually created tables.
The query looks like:
Select .... from  big_table  join (select factkey,... From dimension1_fact
join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
big_table.key = join_dimension_table .factkey
Can not sure this can reduce the execution time, and may increase the
execution time (you can do some experiments :) ).


On 09-7-14 下午1:45, "Namit Jain" <[email protected]> wrote:

> The large table is only a problem if the number of values for a given key are
> very large – they are stored in memory.
> 
> If you dimension tables are small, you can use map-join. That way, no
> reduction is needed. You need to specify the hint in the select clause
> and the list of tables which are small.
> 
> 
> For eg:
> 
> 
> select /*+ MAPJOIN(smalll_table1, small_table2) */  ... from small_table1 join
> big_table on ... join small_table2 on ...
> 
> 
> Thanks,
> -namit
> 
> 
> On 7/13/09 9:44 PM, "Jason Michael" <[email protected]> wrote:
> 
>> In our hive instance, we have one large fact-type table that joins to several
>> dimension tables on integer keys.  I know from reading the Language Manual
>> that in ordering joins it is best to join the largest table last in the
>> sequence in order to minimize memory usage.  This won’t work in the situation
>> where you want to join the large fact table to more than one dimension.
>> Something like:
>> 
>> select ... from small_table1 join big_table on ... join small_table2 on ...
>> 
>> I have to imagine this is a pretty common pattern, is there any guidance for
>> doing this sort of star schema join?
>> 
>> 
> 

Reply via email to