Weston Pace created ARROW-14186:
-----------------------------------

             Summary: [C++][Dataset] Define appropriate abstractions for 
"fragments" that can handle compute
                 Key: ARROW-14186
                 URL: https://issues.apache.org/jira/browse/ARROW-14186
             Project: Apache Arrow
          Issue Type: Wish
          Components: C++
            Reporter: Weston Pace


This issue has come up in flight (ARROW-10524) and Skyhook (ARROW-13607).  In 
both cases there is a desire to scan data from remote data sources.  In both 
cases the remote data sources can be capable of essentially running their own 
query engine.  I went ahead and created a JIRA to capture some of the 
discussion.

So maybe this is a question of "how does the datasets API handle distributed 
query?" which is maybe a subquestion of "what is the future of the datasets API 
given richer query frontends?"

If we treat datasets API as a simple query engine frontend limited to 
scan->filter->project->collect|head|count graphs then filtering can be pushed 
down (and returned with a guarantee) and projection probably can't be pushed 
down if there are multiple data sources.  Head can be pushed down but not count 
without some effort.

If we're thinking of the datasets API as a scan node for a more general query 
engine then I think things get complex rather quickly.  I'm not sure if the 
above rules apply or not.  For example, a join might combine data from two 
different source.  A filter that compares columns on both sides of the join 
could not be pushed down.  I'm sure these problems are figured out by more 
general purpose distributed query engines (which presumably slice the query 
plan into smaller query plans for each individual node).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to