Hi Haopu, How about full outer join? One hash table may not be efficient for this case.
Liquan On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang <hw...@qilinsoft.com> wrote: > Hi, Liquan, thanks for the response. > > > > In your example, I think the hash table should be built on the "right" > side, so Spark can iterate through the left side and find matches in the > right side from the hash table efficiently. Please comment and suggest, > thanks again! > > > ------------------------------ > > *From:* Liquan Pei [mailto:liquan...@gmail.com] > *Sent:* 2014年9月30日 12:31 > *To:* Haopu Wang > *Cc:* dev@spark.apache.org; user > *Subject:* Re: Spark SQL question: why build hashtable for both sides in > HashOuterJoin? > > > > Hi Haopu, > > > > My understanding is that the hashtable on both left and right side is used > for including null values in result in an efficient manner. If hash table > is only built on one side, let's say left side and we perform a left outer > join, for each row in left side, a scan over the right side is needed to > make sure that no matching tuples for that row on left side. > > > > Hope this helps! > > Liquan > > > > On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang <hw...@qilinsoft.com> wrote: > > I take a look at HashOuterJoin and it's building a Hashtable for both > sides. > > This consumes quite a lot of memory when the partition is big. And it > doesn't reduce the iteration on streamed relation, right? > > Thanks! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > -- > Liquan Pei > Department of Physics > University of Massachusetts Amherst > -- Liquan Pei Department of Physics University of Massachusetts Amherst