[GitHub] [arrow-datafusion] ygf11 commented on pull request #5087: Make the required distribution of NestedLoopJoinExec consistent with CrossJoinExec

via GitHub Fri, 27 Jan 2023 21:13:40 -0800


ygf11 commented on PR #5087:
URL: 
https://github.com/apache/arrow-datafusion/pull/5087#issuecomment-1407296023


   > When I first implement the NLJ, I didn't consider the JoinSelection for 
the NLJ, I think it is an optimization for NLJ.
   
   Yes, it is an optimization. The `JoinSelection` will choose smaller 
side(`total_byte_size`) as left side(NLJ has not been supported).
   
   > why the distribution should be consistent between NLJ and Cross-join?
   
   For `JoinSelection`, after I think more, I think the benefit of the 
consistent is we can reuse the optimize logic of `CrossJoin`. Even they are not 
consistent, we can still add similar optimization for `NestedLoopJoin` in 
`JoinSelection`.
   
   So the main improvement of this pr is `NestedLoopJoinExec` will build one 
side data, not two sides for `Full` join. This can improve performance.
   
   > I think it will bring other performance issue for other join type when we 
always collect the left the NLJ.
   
   Since `JoinSelection` will swap the orders, so there is no difference 
between build-left and build-right.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] ygf11 commented on pull request #5087: Make the required distribution of NestedLoopJoinExec consistent with CrossJoinExec

Reply via email to