[GitHub] [arrow-datafusion] liukun4515 commented on issue #4356: refactor the code of the `HashJoin`

GitBox Fri, 25 Nov 2022 01:11:06 -0800


liukun4515 commented on issue #4356:
URL: 
https://github.com/apache/arrow-datafusion/issues/4356#issuecomment-1327188814


   > > The current code base has many `match` path for each `Join_type`, each 
`join_type` has different logic and path, it easy to produce the bugs when we 
add feature in the `HashJoin`.
   > 
   > Yes, I agree.
   > 
   > > split vectorization `HashJoin` to three phase:
   > > 
   > > 1. get the result of matched equal join : left_idx and right_idx
   > > 2. apply non_equal filter to `left_idx and right_idx` and get the 
filter_left_idx with filter_right_idx
   > > 3. according to the `Join Type` to construct the result
   > 
   > For HashJoin, there are two big phases: **build** and **probe**:
   > 
   > 1. For **build** phase, we don't care **JoinType** almost
   > 2. For **probe** phase, **JoinType** is the direction.  So how about 
spitting `match` paths at the beginning of **probe** phase
   >    ```rust
   >     match join_type {
   >         inner => probe_inner_join(),
   >         left => probe_left_join(),
   >         ....
   >     }
   >    ```
   >    
   >    
   >        
   >          
   >        
   >    
   >          
   >        
   >    
   >        
   >      
   >    In each probe method, we can process non-equi conditions and equi 
conditions. Non-equi conditions's results depend on **JoinType**
   
   Probe phase has many common stage. 
   In the vectorization has join, the first stage is to get the left/right 
indices which are match the on join condition.
   
   Next, use the left/right indices to generate the batch result according to 
the join type. But some special join type should maintain the left side bitmap 
to generate the result finally, for example left/full/leftanti/leftsemi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] liukun4515 commented on issue #4356: refactor the code of the `HashJoin`

Reply via email to