[
https://issues.apache.org/jira/browse/ARROW-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423000#comment-17423000
]
Weston Pace commented on ARROW-14186:
-------------------------------------
[~Jayjeet][~heyjc][~lidavidm][~cpcloud]
> [C++][Dataset] Define appropriate abstractions for "fragments" that can
> handle compute
> --------------------------------------------------------------------------------------
>
> Key: ARROW-14186
> URL: https://issues.apache.org/jira/browse/ARROW-14186
> Project: Apache Arrow
> Issue Type: Wish
> Components: C++
> Reporter: Weston Pace
> Priority: Major
>
> This issue has come up in flight (ARROW-10524) and Skyhook (ARROW-13607). In
> both cases there is a desire to scan data from remote data sources. In both
> cases the remote data sources can be capable of essentially running their own
> query engine. I went ahead and created a JIRA to capture some of the
> discussion.
> So maybe this is a question of "how does the datasets API handle distributed
> query?" which is maybe a subquestion of "what is the future of the datasets
> API given richer query frontends?"
> If we treat datasets API as a simple query engine frontend limited to
> scan->filter->project->collect|head|count graphs then filtering can be pushed
> down (and returned with a guarantee) and projection probably can't be pushed
> down if there are multiple data sources. Head can be pushed down but not
> count without some effort.
> If we're thinking of the datasets API as a scan node for a more general query
> engine then I think things get complex rather quickly. I'm not sure if the
> above rules apply or not. For example, a join might combine data from two
> different source. A filter that compares columns on both sides of the join
> could not be pushed down. I'm sure these problems are figured out by more
> general purpose distributed query engines (which presumably slice the query
> plan into smaller query plans for each individual node).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)