jorgecarleitao opened a new pull request #7830:
URL: https://github.com/apache/arrow/pull/7830
This is PR contains a physical plan to execute an inner join. I have not ran
any benchmark, this is pure implementation plus some tests.
The gist of the implementation for a given partition is:
```python
for left_record in left_records:
hash_left = build_hash_of_keys(left_record)
for right_record in right_records:
hash_right = build_hash_of_keys(right_record)
indexes = inner_join(hash_left, hash_right)
yield concat(left_record, right_record)[indexes]
```
I.e. inefficient.
The implementation is currently sequential, even though it can be trivially
distributed as each RecordBatch is evaluated independently (we still lock the
mutex on partition reading, as in other physical plans). Since we have not
committed to a distributed computational model, IMO the sequential is enough
for now.
This PR is built on top of #7687 and #7796
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]