[jira] [Commented] (ARROW-9423) [Rust][DataFusion] Add join

Andy Grove (Jira) Sat, 26 Sep 2020 09:56:33 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202629#comment-17202629
 ]


Andy Grove commented on ARROW-9423:
-----------------------------------

I am still planning on implementing this, but would like to get the scheduler / 
threading model resolved first.

> [Rust][DataFusion] Add join
> ---------------------------
>
>                 Key: ARROW-9423
>                 URL: https://issues.apache.org/jira/browse/ARROW-9423
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Rust - DataFusion
>            Reporter: Jorge
>            Assignee: Andy Grove
>            Priority: Major
>
> A major operation in analytics is the join. This issue concerns adding the 
> join operation.
> Given the complexity of this task, I propose starting with a sub-set of all 
> joins, an hash join whose "ON" can only be a set of column names (i.e. no 
> expressions).
> Suggestion for DOD:
> * physical plan to execute the join
> * logical plan with the join
> * SQL planner with the join
> * tests on each of the above
> One idea to perform this join in parallel is to, for each RecordBatch in the 
> left, perform the join with a record on the right. Another way is to first 
> perform a hash by key and sort on both sides, and then perform a 
> "SortMergeJoin" on each of the partitions. There may be better ways to 
> achieve this, though.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9423) [Rust][DataFusion] Add join

Reply via email to