If the dimension table only got a few thousand rows, then map side join will be the best.
What I was saying is not real index push down join, you can refer to http://portal.acm.org/citation.cfm?id=564691.564754 for more details about index push down join. My previous description of index push down join is not suitable for your situation: 1) the dimension table only got a few thousand rows. 2) your fact/dimension table is not fat at all. 3) there is no predicate for dimension table And an example to explain my previous description: create table f(f_key int, d1_key int, d2_key int, measure int); create table d1(d1_key int, d1_value string); create table d2(d2_key int, d2_value string); create several view table(these tables can be shared among many related queries): create table f_d1(f_key int, d1_key int, d1_attr...); create table f_d2(f_key int, d2_key int, d2_attr...); First do f_d1 join f_d2, then join with the big fat fact table. (the big table is at last and only scan one time). It assumes f_d1 and f_d2 are much smaller than fact table and dimension table, which does not hold in your situation, they become much bigger :( . On 09-7-14 下午11:11, "David Lerman" <[email protected]> wrote: > Thanks. Namit, map-join looks promising. He, I'm not quite sure I'm > following yet. > > In our (simplified) case, we have a fact table with many millions of rows > (call it f), which joins to two dimension tables (call them d1 and d2), each > of which has at most a few thousand rows. > > So: > create table f(f_key int, d1_key int, d2_key int, measure int); > create table d1(d1_key int, d1_value string); > create table d2(d2_key int, d2_value string); > > The facts are split among the dimensions extremely disproportionately, so > although d2 has 1000 rows, 90% of the facts are linked to the top 10. > > The query is: > select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key = > f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value; > > We're finding that though we have 100 reducers, 98 of them are executing > very quickly while 2 get stuck with the vast majority of the reduce records > and crash with out-of-memory exceptions even though they have multiple gigs > of memory allocated. > > This raises two questions: > > 1. With this kind of join (small x big x small), what's the best way to > keep the memory usage down? Map-join definitely looks promising, and He, > I'd like to understand the index push down join approach so we can try it > out as well. > > 2. I'm assuming we're getting those 2 reducers stuck with the whole load > because the reduce records get partitioned by the join key and the vast > majority of records use just a few keys. Is there anything we can do about > this? > > Thanks so much! > Dave > > On 7/14/09 8:51 AM, "He Yongqiang" <[email protected]> wrote: > >> mapside join is the most efficient, but as Namit mentioned, it also has some >> limitation. >> >>>> If you dimension tables are small, you can use map-join. >> Can use temporary table for map-join if a dimension table is not small but >> predicates on that table can dramatically reduce number of rows. >> >> Another optimization you can experiment is so called “index push down join”. >> Since hive does not support index ( and of course no bitmap/bitvector) right >> now, you can mimic the techniques by yourself. The key is to pre-execute >> several join tables, and to find some ways first join small tables and then >> join with the fact table. >> >> create table dimension1_fact(d1_key, fact_key, attributes of dimension table >> d1 which are mostly used in queries ); >> create table dimension2_fact(d2_key, fact_key, attributes of dimension table >> d2 which are mostly used in queries ); >> etc ... >> And you can translate you queries to these manually created tables. >> The query looks like: >> Select .... from big_table join (select factkey,... From dimension1_fact >> join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on >> big_table.key = join_dimension_table .factkey >> Can not sure this can reduce the execution time, and may increase the >> execution time (you can do some experiments :) ). >> >> >> On 09-7-14 下午1:45, "Namit Jain" <[email protected]> wrote: >> >>> The large table is only a problem if the number of values for a given key >>> are >>> very large – they are stored in memory. >>> >>> If you dimension tables are small, you can use map-join. That way, no >>> reduction is needed. You need to specify the hint in the select clause >>> and the list of tables which are small. >>> >>> >>> For eg: >>> >>> >>> select /*+ MAPJOIN(smalll_table1, small_table2) */ ... from small_table1 >>> join >>> big_table on ... join small_table2 on ... >>> >>> >>> Thanks, >>> -namit >>> >>> >>> On 7/13/09 9:44 PM, "Jason Michael" <[email protected]> wrote: >>> >>>> In our hive instance, we have one large fact-type table that joins to >>>> several >>>> dimension tables on integer keys. I know from reading the Language Manual >>>> that in ordering joins it is best to join the largest table last in the >>>> sequence in order to minimize memory usage. This won’t work in the >>>> situation >>>> where you want to join the large fact table to more than one dimension. >>>> Something like: >>>> >>>> select ... from small_table1 join big_table on ... join small_table2 on ... >>>> >>>> I have to imagine this is a pretty common pattern, is there any guidance >>>> for >>>> doing this sort of star schema join? >>>> >>>> >>> >
