RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Haopu Wang
Liquan, yes, for full outer join, one hash table on both sides is more 
efficient.

 

For the left/right outer join, it looks like one hash table should be enought.

 



From: Liquan Pei [mailto:liquan...@gmail.com] 
Sent: 2014年9月30日 18:34
To: Haopu Wang
Cc: d...@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in 
HashOuterJoin?

 

Hi Haopu,

 

How about full outer join? One hash table may not be efficient for this case. 

 

Liquan

 

On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:

Hi, Liquan, thanks for the response.

 

In your example, I think the hash table should be built on the right side, so 
Spark can iterate through the left side and find matches in the right side from 
the hash table efficiently. Please comment and suggest, thanks again!

 



From: Liquan Pei [mailto:liquan...@gmail.com] 
Sent: 2014年9月30日 12:31
To: Haopu Wang
Cc: d...@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in 
HashOuterJoin?

 

Hi Haopu,

 

My understanding is that the hashtable on both left and right side is used for 
including null values in result in an efficient manner. If hash table is only 
built on one side, let's say left side and we perform a left outer join, for 
each row in left side, a scan over the right side is needed to make sure that 
no matching tuples for that row on left side. 

 

Hope this helps!

Liquan

 

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

I take a look at HashOuterJoin and it's building a Hashtable for both
sides.

This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





 

-- 
Liquan Pei 
Department of Physics 
University of Massachusetts Amherst 





 

-- 
Liquan Pei 
Department of Physics 
University of Massachusetts Amherst 



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. 
Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer 
joins do both, and it seems like we could optimize it for those that are not 
full.

Matei


On Oct 7, 2014, at 11:04 PM, Haopu Wang hw...@qilinsoft.com wrote:

 Liquan, yes, for full outer join, one hash table on both sides is more 
 efficient.
  
 For the left/right outer join, it looks like one hash table should be enought.
  
 From: Liquan Pei [mailto:liquan...@gmail.com] 
 Sent: 2014年9月30日 18:34
 To: Haopu Wang
 Cc: d...@spark.apache.org; user
 Subject: Re: Spark SQL question: why build hashtable for both sides in 
 HashOuterJoin?
  
 Hi Haopu,
  
 How about full outer join? One hash table may not be efficient for this case. 
  
 Liquan
  
 On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:
 Hi, Liquan, thanks for the response.
  
 In your example, I think the hash table should be built on the right side, 
 so Spark can iterate through the left side and find matches in the right side 
 from the hash table efficiently. Please comment and suggest, thanks again!
  
 From: Liquan Pei [mailto:liquan...@gmail.com] 
 Sent: 2014年9月30日 12:31
 To: Haopu Wang
 Cc: d...@spark.apache.org; user
 Subject: Re: Spark SQL question: why build hashtable for both sides in 
 HashOuterJoin?
  
 Hi Haopu,
  
 My understanding is that the hashtable on both left and right side is used 
 for including null values in result in an efficient manner. If hash table is 
 only built on one side, let's say left side and we perform a left outer join, 
 for each row in left side, a scan over the right side is needed to make sure 
 that no matching tuples for that row on left side. 
  
 Hope this helps!
 Liquan
  
 On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:
 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.
 
 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?
 
 Thanks!
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
  
 -- 
 Liquan Pei 
 Department of Physics 
 University of Massachusetts Amherst
 
 
  
 -- 
 Liquan Pei 
 Department of Physics 
 University of Massachusetts Amherst



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Liquan Pei
I am working on a PR to leverage the HashJoin trait code to optimize the
Left/Right outer join. It's already been tested locally and will send out
the PR soon after some clean up.

Thanks,
Liquan

On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 I'm pretty sure inner joins on Spark SQL already build only one of the
 sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators.
 Only outer joins do both, and it seems like we could optimize it for those
 that are not full.

 Matei



 On Oct 7, 2014, at 11:04 PM, Haopu Wang hw...@qilinsoft.com wrote:

 Liquan, yes, for full outer join, one hash table on both sides is more
 efficient.

 For the left/right outer join, it looks like one hash table should be
 enought.

 --
 *From:* Liquan Pei [mailto:liquan...@gmail.com liquan...@gmail.com]
 *Sent:* 2014年9月30日 18:34
 *To:* Haopu Wang
 *Cc:* d...@spark.apache.org; user
 *Subject:* Re: Spark SQL question: why build hashtable for both sides in
 HashOuterJoin?

 Hi Haopu,

 How about full outer join? One hash table may not be efficient for this
 case.

 Liquan

 On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:
 Hi, Liquan, thanks for the response.

 In your example, I think the hash table should be built on the right
 side, so Spark can iterate through the left side and find matches in the
 right side from the hash table efficiently. Please comment and suggest,
 thanks again!

 --
 *From:* Liquan Pei [mailto:liquan...@gmail.com]
 *Sent:* 2014年9月30日 12:31
 *To:* Haopu Wang
 *Cc:* d...@spark.apache.org; user
 *Subject:* Re: Spark SQL question: why build hashtable for both sides in
 HashOuterJoin?

 Hi Haopu,

 My understanding is that the hashtable on both left and right side is used
 for including null values in result in an efficient manner. If hash table
 is only built on one side, let's say left side and we perform a left outer
 join, for each row in left side, a scan over the right side is needed to
 make sure that no matching tuples for that row on left side.

 Hope this helps!
 Liquan

 On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.

 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst



 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst





-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst


RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Haopu Wang
Hi, Liquan, thanks for the response.

 

In your example, I think the hash table should be built on the right side, so 
Spark can iterate through the left side and find matches in the right side from 
the hash table efficiently. Please comment and suggest, thanks again!

 



From: Liquan Pei [mailto:liquan...@gmail.com] 
Sent: 2014年9月30日 12:31
To: Haopu Wang
Cc: d...@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in 
HashOuterJoin?

 

Hi Haopu,

 

My understanding is that the hashtable on both left and right side is used for 
including null values in result in an efficient manner. If hash table is only 
built on one side, let's say left side and we perform a left outer join, for 
each row in left side, a scan over the right side is needed to make sure that 
no matching tuples for that row on left side. 

 

Hope this helps!

Liquan

 

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

I take a look at HashOuterJoin and it's building a Hashtable for both
sides.

This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





 

-- 
Liquan Pei 
Department of Physics 
University of Massachusetts Amherst 



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Liquan Pei
Hi Haopu,

How about full outer join? One hash table may not be efficient for this
case.

Liquan

On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:

Hi, Liquan, thanks for the response.



 In your example, I think the hash table should be built on the right
 side, so Spark can iterate through the left side and find matches in the
 right side from the hash table efficiently. Please comment and suggest,
 thanks again!


  --

 *From:* Liquan Pei [mailto:liquan...@gmail.com]
 *Sent:* 2014年9月30日 12:31
 *To:* Haopu Wang
 *Cc:* d...@spark.apache.org; user
 *Subject:* Re: Spark SQL question: why build hashtable for both sides in
 HashOuterJoin?



 Hi Haopu,



 My understanding is that the hashtable on both left and right side is used
 for including null values in result in an efficient manner. If hash table
 is only built on one side, let's say left side and we perform a left outer
 join, for each row in left side, a scan over the right side is needed to
 make sure that no matching tuples for that row on left side.



 Hope this helps!

 Liquan



 On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.

 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst




-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst


Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both
sides.

This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Liquan Pei
Hi Haopu,

My understanding is that the hashtable on both left and right side is used
for including null values in result in an efficient manner. If hash table
is only built on one side, let's say left side and we perform a left outer
join, for each row in left side, a scan over the right side is needed to
make sure that no matching tuples for that row on left side.

Hope this helps!
Liquan

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.

 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst