Hi Ning,
there is something confusing me. Is the "regular reduce-side join" something 
like the Pig join approach (http://wiki.apache.org/pig/PigSkewedJoinSpec) ? 
Where is the default join which read all the tables and shuffle the records 
with the same key to the same reducer and do the join there?

Thanks.
-Gang

________________________________
发件人: Ning Zhang <[email protected]>
收件人: "[email protected]" <[email protected]>
发送日期: 2010/2/25 (周四) 1:19:16 下午
主   题: Re: map join and OOM


Edward, 

Multi-table joins are practical and we do a lot of these here.  If the OOM 
exception was thrown from a regular reduce-side join, it may be caused by 
skewness in your join keys.

From branch-0.5 and forward, you will have a parameter hive.join.cache.size to 
control how many rows you want to put into cache when doing joins. Before 
branch-0.5, what we do for regular reduce-side join is that, for each join key, 
we will read all rows from the small tables (the ones except the rightmost 
table) in main memory and join with the rightmost table (a.k.a the streming 
table). Note this is the for each distinct join key. If you have a lot of rows 
in the small tables but each one has small # of rows for each join key, the 
memory consumption is not that large. But if you have skewness in your join 
key, you may see OOM exceptions, particularly when you joining multiple tables 
in the same join operator. 

For map-side join pre branch 0.5, we read all rows from the small table 
specified by the mapjoin hint, and build a hashmap on the join key. From branch 
0.5, you can also specify a parameter hive.mapjoin.max.numrows to specify how 
many rows you want to read into cache for each map-joined table. 

Thanks,
Ning



>@Ning
>
>
>Map-join works with persistent data structure staring from branch 0.5. We will 
>keep a cache >>using HashMap in main memory and you can specify the # of rows 
>to be put in the cache. If it >>is too large, data will be spilled to 
>disk-based hash table.
>>>
>So I have a problem, I am trying to join 5 large tables. My pageview
>table is partitioned by day ~6GB but I have 4 other tables that can be
>about 1GB each.
>
>Currently joining the 5 tables throws OOM.
>
>I see only two ways I can solve this.
>1) break my joins up into stages, each stage is written to a hive table
>
>@Ning. or anyone else
>
>With the enhancements to mapjoin in 5.0, will my 5 table join work
>(theoretically) if I use mapjoin? Speed is not really an issue here,
>but if joining 2+ tables is not practical in most cases I need to
>start planning for this now.
>
>Thanks
>



      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

Reply via email to