c21 commented on pull request #29130:
URL: https://github.com/apache/spark/pull/29130#issuecomment-660137442


   @cloud-fan and @imback82 - I was not aware of 
https://github.com/apache/spark/pull/28676 before making this PR. After 
checking https://github.com/apache/spark/pull/28676, TLDR is I think we are 
solving different issues and there's no code conflict between these two.
   
   https://github.com/apache/spark/pull/28676: preserve **broadcast hash join** 
build side partitioning for inner join if the stream side is hash partitioned. 
@imback82 It's a great idea that I never thought about it before. I bet in 
production, out users should have hit this issue before, but I think our action 
was just asking them to disable broadcast join (SMJ on small table, instead of 
broadcasting it - the cost of it is small, as the table should be small enough 
to be broadcasted), then partitioning info gets propagated through query plan, 
and the followed shuffle can be saved. But I think 
https://github.com/apache/spark/pull/28676 handles the thing automatically on 
spark side, which should be better.
   
   this PR: preserve **shuffled hash join** build side partitioning, which is a 
much smaller trivial change compared to handle broadcast hash join. Because for 
required children distribution, shuffled hash join is same as sort merge join, 
shuffled hash join output partitioning should be same as sort merge join 
(except it cannot handle full outer join case). We found this issue when our 
users did shuffled hash join on bucketed tables, and had followed join / 
group-by on build side. We have run this in production for more than one years. 
There's no config needed for this feature, and IMO it could be enabled by 
default. 
   
   What do you think? Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to