[
https://issues.apache.org/jira/browse/ARROW-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583752#comment-17583752
]
Weston Pace commented on ARROW-14071:
-------------------------------------
Aggregation (summarise) operations are a little trickier and some C++ work will
probably need to be done to formalize things a bit but it is not a huge lift
from where we are. In Acero, aggregate operations are stateful and are
actually represented by three different functions: consume (process a batch and
update state), merge (combine two states into one), and finalize (turn a state
into record batches). So, for example, the min operation is:
consume - find the minimum value in a batch and, if smaller than state.min,
update state.min
merge - compare two state.min and update with the smaller of the two
finalize - turn state.min into an int64 scalar
When writing UDFs we would probably need to follow the same structure and
provide three R functions for each aggregate operation.
> [R] Try to arrow_eval user-defined functions
> --------------------------------------------
>
> Key: ARROW-14071
> URL: https://issues.apache.org/jira/browse/ARROW-14071
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Neal Richardson
> Assignee: Dragoș Moldovan-Grünfeld
> Priority: Major
> Labels: pull-request-available
> Time Spent: 2h 50m
> Remaining Estimate: 0h
>
> *Proposed approach*:
> * Investigate simple injection with {{!!}}
> * Investigate injection with {{rlang::inject()}}
> * Investigate the data mask levels approach
> * investigate other (undefined) approaches
> *Original description*:
> The first test passes but the second one fails, even though they're
> equivalent. The user's function isn't being evaluated in the nse_funcs
> environment.
> {code}
> expect_dplyr_equal(
> input %>%
> select(-fct) %>%
> filter(nchar(padded_strings) < 10) %>%
> collect(),
> tbl
> )
> isShortString <- function(x) nchar(x) < 10
> expect_dplyr_equal(
> input %>%
> select(-fct) %>%
> filter(isShortString(padded_strings)) %>%
> collect(),
> tbl
> )
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)