Re: Putting the big table rightmost in the join

Edward Capriolo Fri, 19 Feb 2010 12:48:39 -0800

On Fri, Feb 19, 2010 at 3:40 PM, Yongqiang He
<[email protected]> wrote:
> Yes. I agree.
> Moving work should be done by developers to users is always not
> user-friendly. :)
> For star-schemes, people can easily find out which table is the small table.
> Joining with more than one big table is always risky.
>
> But before a cost optimizer, it seems we have no other choice.
>
> Thanks
> Yongqiang
> On 2/19/10 11:35 AM, "Edward Capriolo" <[email protected]> wrote:
>
>> On Fri, Feb 19, 2010 at 2:25 PM, Edward Capriolo <[email protected]>
>> wrote:
>>> On Fri, Feb 19, 2010 at 12:35 AM, Yongqiang He
>>> <[email protected]> wrote:
>>>> Hi Edward,
>>>> You can do it with streamtable hint. Hive will put the table in that hint 
>>>> in
>>>> the rightmost.
>>>>
>>>> -yongqiang
>>>> On 2/18/10 3:21 PM, "Edward Capriolo" <[email protected]> wrote:
>>>>
>>>>> I have worked through this issue.
>>>>>
>>>>> * When doing Join, please put the table with big number of rows
>>>>> containing the same join key to
>>>>> the rightmost in the JOIN clause. Otherwise we may see OutOfMemory errors.
>>>>>
>>>>> This advice does work, but should we open up a jira to create a simple
>>>>> optimizer that does this?
>>>>>
>>>>> Edward
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> I do not understand the hint. A user can re-write the query can't they?
>>>
>>> select a join b
>>> select b join a
>>>
>>> What I am asking, should we add an optimizer that uses does heuristics
>>> on the tables and automatically streams the smaller/larger?
>>>
>>
>> The reason I am mentioning this is I am training hive users right now.
>> You can imagine that the first three table join someone did caused an
>> OOM. I explained to them roughly how a hive join works and how you
>> should move the largest table to one side. They understood but
>> replied, "Sounds like something an optimizer could handle."
>>
>> Even joining two tables it is a pain to ask someone to find which
>> table is larger. Imagine joining 10 or so. Also user perception, image
>> your first join throwing in OOM.
>>
>>
>
>
>


Understood but do you believe sampling tables as I suggested above,
could work as an effective cost optimizer? The other option to compute
stats is a problem Hive does not always produce the data it queries.

Re: Putting the big table rightmost in the join

Reply via email to