You are doing 2 joins which are not getting merged because the join keys are different.
1. Map Join is the best way to get around this, both the dimension tables will be in memory. This should work for you. 2. In order to make sure that fact table is the outer most table, break up the queries: select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key = f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value; Change it to: Insert overwrite table tmp1 select * from d1 join f on d1.d1_key = f.d1_key; select tmp1.d1_value, d2.d2_value, sum(measure) from d2 join tmp1 on d2.d2_key = tmp1.d2_key group by tmp1.d1_value, d2.d2_value; -namit On 7/14/09 8:11 AM, "David Lerman" <[email protected]> wrote: Thanks. Namit, map-join looks promising. He, I'm not quite sure I'm following yet. In our (simplified) case, we have a fact table with many millions of rows (call it f), which joins to two dimension tables (call them d1 and d2), each of which has at most a few thousand rows. So: create table f(f_key int, d1_key int, d2_key int, measure int); create table d1(d1_key int, d1_value string); create table d2(d2_key int, d2_value string); The facts are split among the dimensions extremely disproportionately, so although d2 has 1000 rows, 90% of the facts are linked to the top 10. The query is: select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key = f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value; We're finding that though we have 100 reducers, 98 of them are executing very quickly while 2 get stuck with the vast majority of the reduce records and crash with out-of-memory exceptions even though they have multiple gigs of memory allocated. This raises two questions: 1. With this kind of join (small x big x small), what's the best way to keep the memory usage down? Map-join definitely looks promising, and He, I'd like to understand the index push down join approach so we can try it out as well. 2. I'm assuming we're getting those 2 reducers stuck with the whole load because the reduce records get partitioned by the join key and the vast majority of records use just a few keys. Is there anything we can do about this? Thanks so much! Dave On 7/14/09 8:51 AM, "He Yongqiang" <[email protected]> wrote: > mapside join is the most efficient, but as Namit mentioned, it also has some > limitation. > >>> If you dimension tables are small, you can use map-join. > Can use temporary table for map-join if a dimension table is not small but > predicates on that table can dramatically reduce number of rows. > > Another optimization you can experiment is so called “index push down join”. > Since hive does not support index ( and of course no bitmap/bitvector) right > now, you can mimic the techniques by yourself. The key is to pre-execute > several join tables, and to find some ways first join small tables and then > join with the fact table. > > create table dimension1_fact(d1_key, fact_key, attributes of dimension table > d1 which are mostly used in queries ); > create table dimension2_fact(d2_key, fact_key, attributes of dimension table > d2 which are mostly used in queries ); > etc ... > And you can translate you queries to these manually created tables. > The query looks like: > Select .... from big_table join (select factkey,... From dimension1_fact > join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on > big_table.key = join_dimension_table .factkey > Can not sure this can reduce the execution time, and may increase the > execution time (you can do some experiments :) ). > > > On 09-7-14 下午1:45, "Namit Jain" <[email protected]> wrote: > >> The large table is only a problem if the number of values for a given key are >> very large – they are stored in memory. >> >> If you dimension tables are small, you can use map-join. That way, no >> reduction is needed. You need to specify the hint in the select clause >> and the list of tables which are small. >> >> >> For eg: >> >> >> select /*+ MAPJOIN(smalll_table1, small_table2) */ ... from small_table1 >> join >> big_table on ... join small_table2 on ... >> >> >> Thanks, >> -namit >> >> >> On 7/13/09 9:44 PM, "Jason Michael" <[email protected]> wrote: >> >>> In our hive instance, we have one large fact-type table that joins to >>> several >>> dimension tables on integer keys. I know from reading the Language Manual >>> that in ordering joins it is best to join the largest table last in the >>> sequence in order to minimize memory usage. This won’t work in the >>> situation >>> where you want to join the large fact table to more than one dimension. >>> Something like: >>> >>> select ... from small_table1 join big_table on ... join small_table2 on ... >>> >>> I have to imagine this is a pretty common pattern, is there any guidance for >>> doing this sort of star schema join? >>> >>> >>
