[GitHub] [arrow-datafusion] liukun4515 commented on pull request #5087: Make the required distribution of NestedLoopJoinExec consistent with CrossJoinExec

via GitHub Fri, 27 Jan 2023 23:08:47 -0800


liukun4515 commented on PR #5087:
URL: 
https://github.com/apache/arrow-datafusion/pull/5087#issuecomment-1407315189


   > Thanks @liukun4515, your question make me think more.
   > 
   > > When I first implement the NLJ, I didn't consider the JoinSelection for 
the NLJ, I think it is an optimization for NLJ.
   > 
   > Yes, it is an optimization. The `JoinSelection` will choose smaller 
side(`total_byte_size`) as left side for `CrossJoinExec` (NLJ has not been 
supported).
   > 
   
   😭, in other database, the `cross join` is one of type of join type, but we 
split them into two set `cross join type` and `other join type` in the 
datafusion.
   
   The `JoinSelection` for the `cross join` can bring benefit for the `nested 
loop join`, but I think it will bring constraint for the distribution of the 
left side and right side.
   
   
   The `cross join` type just contain one type of join, but the NLJ contains 
many other join types, maybe different join type can get benefit from the 
different required distribution.
   
   we can ignore the changes of distribution for the left and right and the 
physical optimization of `JoinSelection`, because without the optimization, the 
physical exec/plan should be able to executed.
   
   we can discuss this 
https://github.com/apache/arrow-datafusion/issues/5022#issuecomment-1407306585 
issue, why the `RepartitionExec` can't be called twice for the same partition. 
If this issue can be resolved, the physical plan can be executed successfully.
   
   > > why the distribution should be consistent between NLJ and Cross-join?
   > 
   > For `JoinSelection`, after I think more, I think the benefit of the 
consistent is we can reuse the optimize logic of `CrossJoin`. Even they are not 
consistent, we can still add similar optimization for `NestedLoopJoin` in 
`JoinSelection`.
   > 
   > So the main improvement of this pr is `NestedLoopJoinExec` will build one 
side data, not two sides for `Full` join. This can improve performance.
   > 
   > > I think it will bring other performance issue for other join type when 
we always collect the left the NLJ.
   > 
   > Since `JoinSelection` will swap the orders, so there is no difference 
between build-left and build-right.
   
   I think we can change the `required_input_distribution` to optimize the 
performance for the `Full` or other join type, if change the implementation 
like your code in this pr and bring a global `visited bit map` of `state: 
Arc<Mutex<NestedLoopJoinState>>`. But the global lock of state will bring some 
negative effects.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] liukun4515 commented on pull request #5087: Make the required distribution of NestedLoopJoinExec consistent with CrossJoinExec

Reply via email to