jorgecarleitao opened a new pull request #7830:
URL: https://github.com/apache/arrow/pull/7830


   This is PR contains a physical plan to execute an inner join. I have not ran 
any benchmark, this is pure implementation plus some tests.
   
   The gist of the implementation for a given partition is:
   
   ```python
   for left_record in left_records:
        hash_left = build_hash_of_keys(left_record)
        for right_record in right_records:
               hash_right = build_hash_of_keys(right_record)
               indexes = inner_join(hash_left, hash_right)
               yield concat(left_record, right_record)[indexes]
   ```
   
   I.e. inefficient.
   
   The implementation is currently sequential, even though it can be trivially 
distributed as each RecordBatch is evaluated independently (we still lock the 
mutex on partition reading, as in other physical plans). Since we have not 
committed to a distributed computational model, IMO the sequential is enough 
for now.
   
   This PR is built on top of #7687 and #7796 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to