[jira] [Commented] (ARROW-15271) [R] Refactor do_exec_plan to return a RecordBatchReader

Neal Richardson (Jira) Tue, 19 Apr 2022 10:37:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-15271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524462#comment-17524462
 ]


Neal Richardson commented on ARROW-15271:
-----------------------------------------

The underlying issue is that there are some operations that can't be fully 
supported with the current ExecPlan:

* sorting on a temporary expression (e.g. {{arrange(ds, x * y)}}): you have to 
project with the expression in it, collect the sorted data, and then drop the 
derived column, but if you drop the column using {{select()}} and run it 
through the ExecPlan again, you lose your sorting because sorting currently 
only happens in the last step (SinkNode)
* {{arrange %>% tail}}: it's implemented as a topK operation, so for {{head}} 
you get data in the right order, but for tail, it's done as reversing the sort 
and taking topK (i.e. there's no bottomK that returns in that order). So you 
have to re-sort the result. This could I guess be done with another ExecPlan 
that re-sorts, so that could yield a RBR, though awkwardly.
* ARROW-14289: head.RecordBatchReader returns Table not RBR, seems easily 
fixable. Also would be nice if ExecPlan had a limit node, or the SinkNode took 
a limit, or something.

I'm not sure what the performance impact would be in these cases if we were to 
compute into a Table, do whatever finishing steps, and push back into a RBR, 
which in most cases is just going to be pulled back into a Table in R. But 
maybe these are sufficiently uncommon scenarios that we shouldn't let them 
shape our API to the extent that they are.

P.S. I haven't checked whether there are open JIRAs for all of those ExecPlan 
issues but there probably should be.

> [R] Refactor do_exec_plan to return a RecordBatchReader
> -------------------------------------------------------
>
>                 Key: ARROW-15271
>                 URL: https://issues.apache.org/jira/browse/ARROW-15271
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 6.0.1
>            Reporter: Will Jones
>            Priority: Major
>
> Right now 
> [{{do_exec_plan}}|https://github.com/apache/arrow/blob/master/r/R/query-engine.R#L18]
>  returns an Arrow table because {{head}}, {{tail}}, and {{arrange}} do. If 
> ARROW-14289 is completed and similar work is done for {{arrange}}, we may be 
> able to alter {{do_exec_plan}} to return a RBR instead.
> The {{map_batches()}} implementation (ARROW-14029) could benefit from this 
> refactor. And it might make ARROW-15040 more useful.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-15271) [R] Refactor do_exec_plan to return a RecordBatchReader

Reply via email to