Hi, Liquan, thanks for the response.
In your example, I think the hash table should be built on the "right" side, so Spark can iterate through the left side and find matches in the right side from the hash table efficiently. Please comment and suggest, thanks again! ________________________________ From: Liquan Pei [mailto:[email protected]] Sent: 2014年9月30日 12:31 To: Haopu Wang Cc: [email protected]; user Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin? Hi Haopu, My understanding is that the hashtable on both left and right side is used for including null values in result in an efficient manner. If hash table is only built on one side, let's say left side and we perform a left outer join, for each row in left side, a scan over the right side is needed to make sure that no matching tuples for that row on left side. Hope this helps! Liquan On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <[email protected]> wrote: I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] -- Liquan Pei Department of Physics University of Massachusetts Amherst
