[jira] [Commented] (ARROW-9555) [Rust] [DataFusion] Add inner (hash) equijoin physical plan

Jira Mon, 16 Nov 2020 22:20:47 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233286#comment-17233286
 ]


Jorge Leitão commented on ARROW-9555:
-------------------------------------

I have a draft almost ready.

The implementation is blocked by #8689. Specifically, I need an API in arrow to 
create an array from multiple arrays (each array in the new recordBatch will be 
composed by multiple entries from different arrays).

This is related to the work I have been doing on generalizing the filtering 
operation (as we also need to build an array a single array), `take` (same 
issue), sort-merge (same, but create an array from two arrays). The join is the 
extreme case where we need to create an array from N+1 arrays (N batches from 
the left + 1 from the right).

I will be working on top of the code in #8689 as that is paramount for me to 
accomplish this.

> [Rust] [DataFusion] Add inner (hash) equijoin physical plan
> -----------------------------------------------------------
>
>                 Key: ARROW-9555
>                 URL: https://issues.apache.org/jira/browse/ARROW-9555
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: Rust - DataFusion
>            Reporter: Jorge Leitão
>            Assignee: Jorge Leitão
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Here is an overview of how I think we should implement support for equijoins, 
> at least for the initial implementation.
>  * Read all batches from the left-side of the join into a single 
> Vec<RecordBatch>
>  * Create a map something like HashMap<Vec<ScalarValue>, Vec<(usize,usize)>> 
> to map keys to batch/row indices
>  * Iterate over this Vec<RecordBatch> and create an entry in a hash map, 
> mapping the join keys to the index of the batch and row in the 
> Vec<RecordBatch>
>  * For each input partition on the right-side of the join, return an output 
> partition that is an iterator/stream that:
>  ** For each input row, evaluate the join keys
>  ** Look up those join keys in the hash map
>  ** If a match is found:
>  *** For each (batch, row) index create an output row which has the values 
> from both the left and right row and emit it
>  ** If no match is found:
>  *** Do not emit a row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9555) [Rust] [DataFusion] Add inner (hash) equijoin physical plan

Reply via email to