Re: Putting the big table rightmost in the join

Edward Capriolo Fri, 19 Feb 2010 11:36:26 -0800

On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <[email protected]> wrote:
> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
> <[email protected]> wrote:
>> Hi Edward,
>> You can do it with streamtable hint. Hive will put the table in that hint in
>> the rightmost.
>>
>> -yongqiang
>> On 2/18/10 3:21 PM, "Edward Capriolo" <[email protected]> wrote:
>>
>>> I have worked through this issue.
>>>
>>> * When doing Join, please put the table with big number of rows
>>> containing the same join key to
>>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.
>>>
>>> This advice does work, but should we open up a jira to create a simple
>>> optimizer that does this?
>>>
>>> Edward
>>>
>>>
>>
>>
>>
>
> I do not understand the hint. A user can re-write the query can't they?
>
> select a join b
> select b join a
>
> What I am asking, should we add an optimizer that uses does heuristics
> on the tables and automatically streams the smaller/larger?
>


The reason I am mentioning this is I am training hive users right now.
You can imagine that the first three table join someone did caused an
OOM. I explained to them roughly how a hive join works and how you
should move the largest table to one side. They understood but
replied, "Sounds like something an optimizer could handle."

Even joining two tables it is a pain to ask someone to find which
table is larger. Imagine joining 10 or so. Also user perception, image
your first join throwing in OOM.

Re: Putting the big table rightmost in the join

Reply via email to