Thanks.  Namit, map-join looks promising.  He, I'm not quite sure I'm
following yet.

In our (simplified) case, we have a fact table with many millions of rows
(call it f), which joins to two dimension tables (call them d1 and d2), each
of which has at most a few thousand rows.

So:
create table f(f_key int, d1_key int, d2_key int, measure int);
create table d1(d1_key int, d1_value string);
create table d2(d2_key int, d2_value string);

The facts are split among the dimensions extremely disproportionately, so
although d2 has 1000 rows, 90% of the facts are linked to the top 10.

The query is:
select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

We're finding that though we have 100 reducers, 98 of them are executing
very quickly while 2 get stuck with the vast majority of the reduce records
and crash with out-of-memory exceptions even though they have multiple gigs
of memory allocated.

This raises two questions:

1.  With this kind of join (small x big x small), what's the best way to
keep the memory usage down?  Map-join definitely looks promising, and He,
I'd like to understand the index push down join approach so we can try it
out as well.

2.  I'm assuming we're getting those 2 reducers stuck with the whole load
because the reduce records get partitioned by the join key and the vast
majority of records use just a few keys.  Is there anything we can do about
this?

Thanks so much!
Dave

On 7/14/09 8:51 AM, "He Yongqiang" <[email protected]> wrote:

> mapside join is the most efficient, but as Namit mentioned, it also has some
> limitation.
> 
>>> If you dimension tables are small, you can use map-join.
> Can use temporary table for map-join if a dimension table  is not small but
> predicates on that table can dramatically reduce number of rows.
> 
> Another optimization you can experiment is so called “index push down join”.
> Since hive does not support index ( and of course no bitmap/bitvector) right
> now, you can mimic the techniques by yourself.  The key is to pre-execute
> several join tables,  and to find some ways first join small tables and then
> join with the fact table.
> 
> create table dimension1_fact(d1_key, fact_key, attributes of dimension table
> d1 which are mostly used in queries );
> create table dimension2_fact(d2_key, fact_key, attributes of dimension table
> d2 which are mostly used in queries );
> etc ...
> And you can translate you queries to these manually created tables.
> The query looks like:
> Select .... from  big_table  join (select factkey,... From dimension1_fact
> join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
> big_table.key = join_dimension_table .factkey
> Can not sure this can reduce the execution time, and may increase the
> execution time (you can do some experiments :) ).
> 
> 
> On 09-7-14 下午1:45, "Namit Jain" <[email protected]> wrote:
> 
>> The large table is only a problem if the number of values for a given key are
>> very large – they are stored in memory.
>> 
>> If you dimension tables are small, you can use map-join. That way, no
>> reduction is needed. You need to specify the hint in the select clause
>> and the list of tables which are small.
>> 
>> 
>> For eg:
>> 
>> 
>> select /*+ MAPJOIN(smalll_table1, small_table2) */  ... from small_table1
>> join 
>> big_table on ... join small_table2 on ...
>> 
>> 
>> Thanks,
>> -namit
>> 
>> 
>> On 7/13/09 9:44 PM, "Jason Michael" <[email protected]> wrote:
>> 
>>> In our hive instance, we have one large fact-type table that joins to
>>> several 
>>> dimension tables on integer keys.  I know from reading the Language Manual
>>> that in ordering joins it is best to join the largest table last in the
>>> sequence in order to minimize memory usage.  This won’t work in the
>>> situation 
>>> where you want to join the large fact table to more than one dimension.
>>> Something like:
>>> 
>>> select ... from small_table1 join big_table on ... join small_table2 on ...
>>> 
>>> I have to imagine this is a pretty common pattern, is there any guidance for
>>> doing this sort of star schema join?
>>> 
>>> 
>> 

Reply via email to