Yes. I agree. Moving work should be done by developers to users is always not user-friendly. :) For star-schemes, people can easily find out which table is the small table. Joining with more than one big table is always risky.
But before a cost optimizer, it seems we have no other choice. Thanks Yongqiang On 2/19/10 11:35 AM, "Edward Capriolo" <[email protected]> wrote: > On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <[email protected]> > wrote: >> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He >> <[email protected]> wrote: >>> Hi Edward, >>> You can do it with streamtable hint. Hive will put the table in that hint in >>> the rightmost. >>> >>> -yongqiang >>> On 2/18/10 3:21 PM, "Edward Capriolo" <[email protected]> wrote: >>> >>>> I have worked through this issue. >>>> >>>> * When doing Join, please put the table with big number of rows >>>> containing the same join key to >>>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors. >>>> >>>> This advice does work, but should we open up a jira to create a simple >>>> optimizer that does this? >>>> >>>> Edward >>>> >>>> >>> >>> >>> >> >> I do not understand the hint. A user can re-write the query can't they? >> >> select a join b >> select b join a >> >> What I am asking, should we add an optimizer that uses does heuristics >> on the tables and automatically streams the smaller/larger? >> > > The reason I am mentioning this is I am training hive users right now. > You can imagine that the first three table join someone did caused an > OOM. I explained to them roughly how a hive join works and how you > should move the largest table to one side. They understood but > replied, "Sounds like something an optimizer could handle." > > Even joining two tables it is a pain to ask someone to find which > table is larger. Imagine joining 10 or so. Also user perception, image > your first join throwing in OOM. > >
