[
https://issues.apache.org/jira/browse/ARROW-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raúl Cumplido updated ARROW-14163:
----------------------------------
Fix Version/s: 9.0.0
(was: 8.0.0)
> [C++] Naive spillover implementation for join
> ---------------------------------------------
>
> Key: ARROW-14163
> URL: https://issues.apache.org/jira/browse/ARROW-14163
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
> Labels: query-engine
> Fix For: 9.0.0
>
>
> A join is a pipeline breaker. I believe the proposed join operators assume
> that the data can fit into memory and queue all incoming batches. For
> example, if I understand correctly,
> https://github.com/apache/arrow/pull/11150 queues the right side until the
> left side had finished.
> There are many clever and interesting ways that this can be optimized
> (divide & conquer, recursive query, prioritize reading the left side and
> pause the right side read). This issue is intentionally not clever or
> interesting.
> Instead, I think it would be good to take advantage of this opportunity to
> start fleshing out our spillover capabilities. A very simplistic
> implementation could be a standalone node which has 2 inputs and 2 outputs.
> The node queues up all incoming data on the "right" input and lets the "left"
> input pass through. Then, when the left input has finished the node will
> release the right input.
> This node could then implement a basic spillover mechanism (e.g. IPC to disk)
> and start to flesh out the abstractions that we will eventually want to
> handle different spillover strategies (abort on spill, spill to disk, and
> spill to s3 are all I can think of at the moment).
--
This message was sent by Atlassian Jira
(v8.20.7#820007)