[ https://issues.apache.org/jira/browse/ARROW-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson resolved ARROW-17463. ------------------------------------- Resolution: Fixed Issue resolved by pull request 13954 [https://github.com/apache/arrow/pull/13954] > [R] Avoid unnecessary projections > --------------------------------- > > Key: ARROW-17463 > URL: https://issues.apache.org/jira/browse/ARROW-17463 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Neal Richardson > Assignee: Neal Richardson > Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > In ExecPlan$Build(), we call Project in a few places, and there is code to > make sure that there is at least one ProjectNode in the query in order to > remove augmented fields from a Dataset scan (unless the user has added them). > As a result, it is possible to get multiple ProjectNodes in a row that are > essentially no-op. One example is with grouped aggregation: there is a > projection to get the order of the columns back to what R expects, and then a > no-op projection after that: > {code} > > mtcars |> arrow_table() |> count(cyl) |> explain() > ExecPlan with 6 nodes: > 5:SinkNode{} > 4:ProjectNode{projection=[cyl, n]} > 3:ProjectNode{projection=[cyl, n]} > 2:GroupByNode{keys=["cyl"], aggregates=[ > hash_sum(n, {skip_nulls=true, min_count=1}), > ]} > 1:ProjectNode{projection=["n": 1, cyl]} > 0:TableSourceNode{} > {code} > IDK how significant of a performance impact this would have, but it certainly > looks wasteful and should be avoidable. -- This message was sent by Atlassian Jira (v8.20.10#820010)