[ 
https://issues.apache.org/jira/browse/ARROW-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583752#comment-17583752
 ] 

Weston Pace commented on ARROW-14071:
-------------------------------------

Aggregation (summarise) operations are a little trickier and some C++ work will 
probably need to be done to formalize things a bit but it is not a huge lift 
from where we are.  In Acero, aggregate operations are stateful and are 
actually represented by three different functions: consume (process a batch and 
update state), merge (combine two states into one), and finalize (turn a state 
into record batches).  So, for example, the min operation is:

consume - find the minimum value in a batch and, if smaller than state.min, 
update state.min
merge - compare two state.min and update with the smaller of the two
finalize - turn state.min into an int64 scalar

When writing UDFs we would probably need to follow the same structure and 
provide three R functions for each aggregate operation.

> [R] Try to arrow_eval user-defined functions
> --------------------------------------------
>
>                 Key: ARROW-14071
>                 URL: https://issues.apache.org/jira/browse/ARROW-14071
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Neal Richardson
>            Assignee: Dragoș Moldovan-Grünfeld
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> *Proposed approach*:
> * Investigate simple injection with {{!!}}
> * Investigate injection with {{rlang::inject()}}
> * Investigate the data mask levels approach
> * investigate other (undefined) approaches
> *Original description*:
> The first test passes but the second one fails, even though they're 
> equivalent. The user's function isn't being evaluated in the nse_funcs 
> environment.
> {code}
>   expect_dplyr_equal(
>     input %>%
>       select(-fct) %>%
>       filter(nchar(padded_strings) < 10) %>%
>       collect(),
>     tbl
>   )
>   isShortString <- function(x) nchar(x) < 10
>   expect_dplyr_equal(
>     input %>%
>       select(-fct) %>%
>       filter(isShortString(padded_strings)) %>%
>       collect(),
>     tbl
>   )
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to