Weston Pace created ARROW-14186:
-----------------------------------
Summary: [C++][Dataset] Define appropriate abstractions for
"fragments" that can handle compute
Key: ARROW-14186
URL: https://issues.apache.org/jira/browse/ARROW-14186
Project: Apache Arrow
Issue Type: Wish
Components: C++
Reporter: Weston Pace
This issue has come up in flight (ARROW-10524) and Skyhook (ARROW-13607). In
both cases there is a desire to scan data from remote data sources. In both
cases the remote data sources can be capable of essentially running their own
query engine. I went ahead and created a JIRA to capture some of the
discussion.
So maybe this is a question of "how does the datasets API handle distributed
query?" which is maybe a subquestion of "what is the future of the datasets API
given richer query frontends?"
If we treat datasets API as a simple query engine frontend limited to
scan->filter->project->collect|head|count graphs then filtering can be pushed
down (and returned with a guarantee) and projection probably can't be pushed
down if there are multiple data sources. Head can be pushed down but not count
without some effort.
If we're thinking of the datasets API as a scan node for a more general query
engine then I think things get complex rather quickly. I'm not sure if the
above rules apply or not. For example, a join might combine data from two
different source. A filter that compares columns on both sides of the join
could not be pushed down. I'm sure these problems are figured out by more
general purpose distributed query engines (which presumably slice the query
plan into smaller query plans for each individual node).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)