[
https://issues.apache.org/jira/browse/ARROW-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-17463:
-----------------------------------
Labels: pull-request-available (was: )
> [R] Avoid unnecessary projections
> ---------------------------------
>
> Key: ARROW-17463
> URL: https://issues.apache.org/jira/browse/ARROW-17463
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Neal Richardson
> Assignee: Neal Richardson
> Priority: Major
> Labels: pull-request-available
> Fix For: 10.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In ExecPlan$Build(), we call Project in a few places, and there is code to
> make sure that there is at least one ProjectNode in the query in order to
> remove augmented fields from a Dataset scan (unless the user has added them).
> As a result, it is possible to get multiple ProjectNodes in a row that are
> essentially no-op. One example is with grouped aggregation: there is a
> projection to get the order of the columns back to what R expects, and then a
> no-op projection after that:
> {code}
> > mtcars |> arrow_table() |> count(cyl) |> explain()
> ExecPlan with 6 nodes:
> 5:SinkNode{}
> 4:ProjectNode{projection=[cyl, n]}
> 3:ProjectNode{projection=[cyl, n]}
> 2:GroupByNode{keys=["cyl"], aggregates=[
> hash_sum(n, {skip_nulls=true, min_count=1}),
> ]}
> 1:ProjectNode{projection=["n": 1, cyl]}
> 0:TableSourceNode{}
> {code}
> IDK how significant of a performance impact this would have, but it certainly
> looks wasteful and should be avoidable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)