ygf11 commented on PR #5087: URL: https://github.com/apache/arrow-datafusion/pull/5087#issuecomment-1407296023
> When I first implement the NLJ, I didn't consider the JoinSelection for the NLJ, I think it is an optimization for NLJ. Yes, it is an optimization. The `JoinSelection` will choose smaller side(`total_byte_size`) as left side(NLJ has not been supported). > why the distribution should be consistent between NLJ and Cross-join? For `JoinSelection`, after I think more, I think the benefit of the consistent is we can reuse the optimize logic of `CrossJoin`. Even they are not consistent, we can still add similar optimization for `NestedLoopJoin` in `JoinSelection`. So the main improvement of this pr is `NestedLoopJoinExec` will build one side data, not two sides for `Full` join. This can improve performance. > I think it will bring other performance issue for other join type when we always collect the left the NLJ. Since `JoinSelection` will swap the orders, so there is no difference between build-left and build-right. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
