Hi Amr, Thanks for your help. Let me try the STREAMTABLE option, if one of the datasets exceeds 1GB.
Vira ________________________________ From: Amr Awadallah [mailto:[email protected]] Sent: Saturday, June 26, 2010 12:58 AM To: [email protected] Subject: Re: MapSide join in Hive Viraj, 1. No 2. Yes, smaller table needs to fit in jvm memory (typically more than 1GB for small table is too large). See slide 7 and after in this preso for different join strategies that can help in case the tables are bucketed and sorted. http://www.slideshare.net/zshao/hive-user-meeting-march-2010-hive-team There is also the /*+STREAMTABLE(tablealias)*/ hint, which you should use for very large tables (or make sure it is the rightmost table in the join clause). -- amr On 6/24/2010 10:43 AM, Viraj Bhat wrote: Hi all, I am joining 2 datasets, one is around 1.5TB in size and the other is around 350MB in size. I wanted to do a Map Side join using "id" as the join column between the two tables. I read about the Mapside join in Hive. http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins. Are there some technical specs on Mapside join on a wiki/jira? Here are some questions: Do the tables need to be sorted on "id"? Is there a restriction on the smaller table size? Are there other join optimizations that Hive provides which I can apply here? Viraj
