Edward, Multi-table joins are practical and we do a lot of these here. If the OOM exception was thrown from a regular reduce-side join, it may be caused by skewness in your join keys.
>From branch-0.5 and forward, you will have a parameter hive.join.cache.size to >control how many rows you want to put into cache when doing joins. Before >branch-0.5, what we do for regular reduce-side join is that, for each join >key, we will read all rows from the small tables (the ones except the >rightmost table) in main memory and join with the rightmost table (a.k.a the >streming table). Note this is the for each distinct join key. If you have a >lot of rows in the small tables but each one has small # of rows for each join >key, the memory consumption is not that large. But if you have skewness in >your join key, you may see OOM exceptions, particularly when you joining >multiple tables in the same join operator. For map-side join pre branch 0.5, we read all rows from the small table specified by the mapjoin hint, and build a hashmap on the join key. From branch 0.5, you can also specify a parameter hive.mapjoin.max.numrows to specify how many rows you want to read into cache for each map-joined table. Thanks, Ning @Ning Map-join works with persistent data structure staring from branch 0.5. We will keep a cache >>using HashMap in main memory and you can specify the # of rows to be put in the cache. If it >>is too large, data will be spilled to disk-based hash table. So I have a problem, I am trying to join 5 large tables. My pageview table is partitioned by day ~6GB but I have 4 other tables that can be about 1GB each. Currently joining the 5 tables throws OOM. I see only two ways I can solve this. 1) break my joins up into stages, each stage is written to a hive table @Ning. or anyone else With the enhancements to mapjoin in 5.0, will my 5 table join work (theoretically) if I use mapjoin? Speed is not really an issue here, but if joining 2+ tables is not practical in most cases I need to start planning for this now. Thanks
