On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <[email protected]> wrote: > On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He > <[email protected]> wrote: >> Hi Edward, >> You can do it with streamtable hint. Hive will put the table in that hint in >> the rightmost. >> >> -yongqiang >> On 2/18/10 3:21 PM, "Edward Capriolo" <[email protected]> wrote: >> >>> I have worked through this issue. >>> >>> * When doing Join, please put the table with big number of rows >>> containing the same join key to >>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors. >>> >>> This advice does work, but should we open up a jira to create a simple >>> optimizer that does this? >>> >>> Edward >>> >>> >> >> >> > > I do not understand the hint. A user can re-write the query can't they? > > select a join b > select b join a > > What I am asking, should we add an optimizer that uses does heuristics > on the tables and automatically streams the smaller/larger? >
The reason I am mentioning this is I am training hive users right now. You can imagine that the first three table join someone did caused an OOM. I explained to them roughly how a hive join works and how you should move the largest table to one side. They understood but replied, "Sounds like something an optimizer could handle." Even joining two tables it is a pain to ask someone to find which table is larger. Imagine joining 10 or so. Also user perception, image your first join throwing in OOM.
