Edward,

Multi-table joins are practical and we do a lot of these here.  If the OOM 
exception was thrown from a regular reduce-side join, it may be caused by 
skewness in your join keys.

>From branch-0.5 and forward, you will have a parameter hive.join.cache.size to 
>control how many rows you want to put into cache when doing joins. Before 
>branch-0.5, what we do for regular reduce-side join is that, for each join 
>key, we will read all rows from the small tables (the ones except the 
>rightmost table) in main memory and join with the rightmost table (a.k.a the 
>streming table). Note this is the for each distinct join key. If you have a 
>lot of rows in the small tables but each one has small # of rows for each join 
>key, the memory consumption is not that large. But if you have skewness in 
>your join key, you may see OOM exceptions, particularly when you joining 
>multiple tables in the same join operator.

For map-side join pre branch 0.5, we read all rows from the small table 
specified by the mapjoin hint, and build a hashmap on the join key. From branch 
0.5, you can also specify a parameter hive.mapjoin.max.numrows to specify how 
many rows you want to read into cache for each map-joined table.

Thanks,
Ning


@Ning

Map-join works with persistent data structure staring from branch 0.5. We will 
keep a cache >>using HashMap in main memory and you can specify the # of rows 
to be put in the cache. If it >>is too large, data will be spilled to 
disk-based hash table.

So I have a problem, I am trying to join 5 large tables. My pageview
table is partitioned by day ~6GB but I have 4 other tables that can be
about 1GB each.

Currently joining the 5 tables throws OOM.

I see only two ways I can solve this.
1) break my joins up into stages, each stage is written to a hive table

@Ning. or anyone else

With the enhancements to mapjoin in 5.0, will my 5 table join work
(theoretically) if I use mapjoin? Speed is not really an issue here,
but if joining 2+ tables is not practical in most cases I need to
start planning for this now.

Thanks

Reply via email to