[ 
https://issues.apache.org/jira/browse/ARROW-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-17463.
-------------------------------------
    Resolution: Fixed

Issue resolved by pull request 13954
[https://github.com/apache/arrow/pull/13954]

> [R] Avoid unnecessary projections
> ---------------------------------
>
>                 Key: ARROW-17463
>                 URL: https://issues.apache.org/jira/browse/ARROW-17463
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Neal Richardson
>            Assignee: Neal Richardson
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 10.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> In ExecPlan$Build(), we call Project in a few places, and there is code to 
> make sure that there is at least one ProjectNode in the query in order to 
> remove augmented fields from a Dataset scan (unless the user has added them). 
> As a result, it is possible to get multiple ProjectNodes in a row that are 
> essentially no-op. One example is with grouped aggregation: there is a 
> projection to get the order of the columns back to what R expects, and then a 
> no-op projection after that:
> {code}
> > mtcars |> arrow_table() |> count(cyl) |> explain()
> ExecPlan with 6 nodes:
> 5:SinkNode{}
>   4:ProjectNode{projection=[cyl, n]}
>     3:ProjectNode{projection=[cyl, n]}
>       2:GroupByNode{keys=["cyl"], aggregates=[
>               hash_sum(n, {skip_nulls=true, min_count=1}),
>       ]}
>         1:ProjectNode{projection=["n": 1, cyl]}
>           0:TableSourceNode{}
> {code}
> IDK how significant of a performance impact this would have, but it certainly 
> looks wasteful and should be avoidable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to