On Fri, Feb 19, 2010 at 3:40 PM, Yongqiang He <[email protected]> wrote: > Yes. I agree. > Moving work should be done by developers to users is always not > user-friendly. :) > For star-schemes, people can easily find out which table is the small table. > Joining with more than one big table is always risky. > > But before a cost optimizer, it seems we have no other choice. > > Thanks > Yongqiang > On 2/19/10 11:35 AM, "Edward Capriolo" <[email protected]> wrote: > >> On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <[email protected]> >> wrote: >>> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He >>> <[email protected]> wrote: >>>> Hi Edward, >>>> You can do it with streamtable hint. Hive will put the table in that hint >>>> in >>>> the rightmost. >>>> >>>> -yongqiang >>>> On 2/18/10 3:21 PM, "Edward Capriolo" <[email protected]> wrote: >>>> >>>>> I have worked through this issue. >>>>> >>>>> * When doing Join, please put the table with big number of rows >>>>> containing the same join key to >>>>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors. >>>>> >>>>> This advice does work, but should we open up a jira to create a simple >>>>> optimizer that does this? >>>>> >>>>> Edward >>>>> >>>>> >>>> >>>> >>>> >>> >>> I do not understand the hint. A user can re-write the query can't they? >>> >>> select a join b >>> select b join a >>> >>> What I am asking, should we add an optimizer that uses does heuristics >>> on the tables and automatically streams the smaller/larger? >>> >> >> The reason I am mentioning this is I am training hive users right now. >> You can imagine that the first three table join someone did caused an >> OOM. I explained to them roughly how a hive join works and how you >> should move the largest table to one side. They understood but >> replied, "Sounds like something an optimizer could handle." >> >> Even joining two tables it is a pain to ask someone to find which >> table is larger. Imagine joining 10 or so. Also user perception, image >> your first join throwing in OOM. >> >> > > >
Understood but do you believe sampling tables as I suggested above, could work as an effective cost optimizer? The other option to compute stats is a problem Hive does not always produce the data it queries.
