[ 
https://issues.apache.org/jira/browse/ARROW-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181278#comment-17181278
 ] 

Andy Grove commented on ARROW-9423:
-----------------------------------

I can take this one. I would like to implement a hash inner join first, where 
one side of the join is loaded into the memory and the other side is streamed, 
performing one hash lookup per row on the streaming side.

I want to address the refactor of the physical plan and introduce the physical 
optimizer first though, since that will be a bit disruptive.

> [Rust][DataFusion] Add join
> ---------------------------
>
>                 Key: ARROW-9423
>                 URL: https://issues.apache.org/jira/browse/ARROW-9423
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Rust - DataFusion
>            Reporter: Jorge
>            Assignee: Andy Grove
>            Priority: Major
>
> A major operation in analytics is the join. This issue concerns adding the 
> join operation.
> Given the complexity of this task, I propose starting with a sub-set of all 
> joins, an hash join whose "ON" can only be a set of column names (i.e. no 
> expressions).
> Suggestion for DOD:
> * physical plan to execute the join
> * logical plan with the join
> * SQL planner with the join
> * tests on each of the above
> One idea to perform this join in parallel is to, for each RecordBatch in the 
> left, perform the join with a record on the right. Another way is to first 
> perform a hash by key and sort on both sides, and then perform a 
> "SortMergeJoin" on each of the partitions. There may be better ways to 
> achieve this, though.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to