paleolimbot commented on code in PR #13397: URL: https://github.com/apache/arrow/pull/13397#discussion_r926757622
########## r/R/compute.R: ########## @@ -306,3 +306,145 @@ cast_options <- function(safe = TRUE, ...) { ) modifyList(opts, list(...)) } + +#' Register user-defined functions +#' +#' These functions support calling R code from query engine execution +#' (i.e., a [dplyr::mutate()] or [dplyr::filter()] on a [Table] or [Dataset]). +#' Use [register_scalar_function()] attach Arrow input and output types to an +#' R function and make it available for use in the dplyr interface and/or +#' [call_function()]. Scalar functions are currently the only type of +#' user-defined function supported. In Arrow, scalar functions must be +#' stateless and return output with the same shape (i.e., the same number +#' of rows) as the input. +#' +#' @param name The function name to be used in the dplyr bindings +#' @param in_type A [DataType] of the input type or a [schema()] +#' for functions with more than one argument. This signature will be used +#' to determine if this function is appropriate for a given set of arguments. +#' If this function is appropriate for more than one signature, pass a +#' `list()` of the above. +#' @param out_type A [DataType] of the output type or a function accepting +#' a single argument (`types`), which is a `list()` of [DataType]s. If a +#' function it must return a [DataType]. +#' @param fun An R function or rlang-style lambda expression. The function +#' will be called with a first argument `context` which is a `list()` Review Comment: A previous version of this PR didn't require the `context` argument when what was the equivalent of `auto_convert` was `TRUE`, but the comment was raised of "why two APIs" (and I agree...one wrapper function scheme is easier to remember). In its current form, the `context` argument provides the information needed for `auto_convert` to do its magic. When `auto_convert` is `TRUE`, you could also use it to do something like `runif(n = context$batch_size)`. The python version also provides the memory pool here but we don't provide a way to use the memory pool for constructing arrays, so I didn't add it to the context object. Because it's a `list()`, assignments won't have any effect outside `fun`. A future version *may* be an environment to avoid the extra unwind protects needed to allocate a new list for each call (but could be one with an overridden `[[<-` to prevent modification). I added some text to the documentation for `fun` and disallowed lambdas for now, since a potential future workaround could be to not pass the `context` argument for an rlang/purrr style lambda (e.g., `~.x + .y` would be the equivalent of `function(context, x, y) x + y`). I hesitate to add too much convenience functionality in this PR since it's already rather unwieldy. I added a `length(formals(fun))` check...you're right that the error message was awful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org