[ 
https://issues.apache.org/jira/browse/ARROW-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423000#comment-17423000
 ] 

Weston Pace commented on ARROW-14186:
-------------------------------------

[~Jayjeet][~heyjc][~lidavidm][~cpcloud]

> [C++][Dataset] Define appropriate abstractions for "fragments" that can 
> handle compute
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-14186
>                 URL: https://issues.apache.org/jira/browse/ARROW-14186
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> This issue has come up in flight (ARROW-10524) and Skyhook (ARROW-13607).  In 
> both cases there is a desire to scan data from remote data sources.  In both 
> cases the remote data sources can be capable of essentially running their own 
> query engine.  I went ahead and created a JIRA to capture some of the 
> discussion.
> So maybe this is a question of "how does the datasets API handle distributed 
> query?" which is maybe a subquestion of "what is the future of the datasets 
> API given richer query frontends?"
> If we treat datasets API as a simple query engine frontend limited to 
> scan->filter->project->collect|head|count graphs then filtering can be pushed 
> down (and returned with a guarantee) and projection probably can't be pushed 
> down if there are multiple data sources.  Head can be pushed down but not 
> count without some effort.
> If we're thinking of the datasets API as a scan node for a more general query 
> engine then I think things get complex rather quickly.  I'm not sure if the 
> above rules apply or not.  For example, a join might combine data from two 
> different source.  A filter that compares columns on both sides of the join 
> could not be pushed down.  I'm sure these problems are figured out by more 
> general purpose distributed query engines (which presumably slice the query 
> plan into smaller query plans for each individual node).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to