Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Liquan Pei Tue, 30 Sep 2014 03:35:35 -0700

Hi Haopu,

How about full outer join? One hash table may not be efficient for this
case.


Liquan

On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang <[email protected]> wrote:

>        Hi, Liquan, thanks for the response.
>
>
>
> In your example, I think the hash table should be built on the "right"
> side, so Spark can iterate through the left side and find matches in the
> right side from the hash table efficiently. Please comment and suggest,
> thanks again!
>
>
>  ------------------------------
>
> *From:* Liquan Pei [mailto:[email protected]]
> *Sent:* 2014年9月30日 12:31
> *To:* Haopu Wang
> *Cc:* [email protected]; user
> *Subject:* Re: Spark SQL question: why build hashtable for both sides in
> HashOuterJoin?
>
>
>
> Hi Haopu,
>
>
>
> My understanding is that the hashtable on both left and right side is used
> for including null values in result in an efficient manner. If hash table
> is only built on one side, let's say left side and we perform a left outer
> join, for each row in left side, a scan over the right side is needed to
> make sure that no matching tuples for that row on left side.
>
>
>
> Hope this helps!
>
> Liquan
>
>
>
> On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <[email protected]> wrote:
>
> I take a look at HashOuterJoin and it's building a Hashtable for both
> sides.
>
> This consumes quite a lot of memory when the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>



-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

Reply via email to