Jorge created ARROW-9423:
----------------------------
Summary: Add join
Key: ARROW-9423
URL: https://issues.apache.org/jira/browse/ARROW-9423
Project: Apache Arrow
Issue Type: Task
Components: Rust - DataFusion
Reporter: Jorge
A major operation in analytics is the join. This issue concerns adding the join
operation.
Given the complexity of this task, I propose starting with a sub-set of all
joins, an inner join whose "ON" can only be a set of column names (i.e. no
expressions).
Suggestion for DOD:
* physical plan to execute the join
* logical plan with the join
* SQL planner with the join
* tests on each of the above
One idea to perform this join in parallel is to, for each RecordBatch in the
left, perform the join with a record on the right. Another way is to first
perform a hash by key and sort on both sides, and then perform a
"SortMergeJoin" on each of the partitions. There may be better ways to achieve
this, though.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)