[GitHub] [arrow] paleolimbot commented on a diff in pull request #13397: ARROW-16444: [R] Implement user-defined scalar functions in R bindings

GitBox Thu, 21 Jul 2022 07:37:24 -0700


paleolimbot commented on code in PR #13397:
URL: https://github.com/apache/arrow/pull/13397#discussion_r926757622



##########
r/R/compute.R:
##########
@@ -306,3 +306,145 @@ cast_options <- function(safe = TRUE, ...) {
   )
   modifyList(opts, list(...))
 }
+
+#' Register user-defined functions
+#'
+#' These functions support calling R code from query engine execution
+#' (i.e., a [dplyr::mutate()] or [dplyr::filter()] on a [Table] or [Dataset]).
+#' Use [register_scalar_function()] attach Arrow input and output types to an
+#' R function and make it available for use in the dplyr interface and/or
+#' [call_function()]. Scalar functions are currently the only type of
+#' user-defined function supported. In Arrow, scalar functions must be
+#' stateless and return output with the same shape (i.e., the same number
+#' of rows) as the input.
+#'
+#' @param name The function name to be used in the dplyr bindings
+#' @param in_type A [DataType] of the input type or a [schema()]
+#'   for functions with more than one argument. This signature will be used
+#'   to determine if this function is appropriate for a given set of arguments.
+#'   If this function is appropriate for more than one signature, pass a
+#'   `list()` of the above.
+#' @param out_type A [DataType] of the output type or a function accepting
+#'   a single argument (`types`), which is a `list()` of [DataType]s. If a
+#'   function it must return a [DataType].
+#' @param fun An R function or rlang-style lambda expression. The function
+#'   will be called with a first argument `context` which is a `list()`

Review Comment:
   A previous version of this PR didn't require the `context` argument when 
what was the equivalent of `auto_convert` was `TRUE`, but the comment was 
raised of "why two APIs" (and I agree...one wrapper function scheme is easier 
to remember).
   
   In its current form, the `context` argument provides the information needed 
for `auto_convert` to do its magic. When `auto_convert` is `TRUE`, you could 
also use it to do something like `runif(n = context$batch_size)`. The python 
version also provides the memory pool here but we don't provide a way to use 
the memory pool for constructing arrays, so I didn't add it to the context 
object.
   
   Because it's a `list()`, assignments won't have any effect outside `fun`. A 
future version *may* be an environment to avoid the extra unwind protects 
needed to allocate a new list for each call (but could be one with an 
overridden `[[<-` to prevent modification).
   
   I added some text to the documentation for `fun` and disallowed lambdas for 
now, since a potential future workaround could be to not pass the `context` 
argument for an rlang/purrr style lambda (e.g., `~.x + .y` would be the 
equivalent of `function(context, x, y) x + y`). I hesitate to add too much 
convenience functionality in this PR since it's already rather unwieldy.
   
   I added a `length(formals(fun))` check...you're right that the error message 
was awful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] paleolimbot commented on a diff in pull request #13397: ARROW-16444: [R] Implement user-defined scalar functions in R bindings

Reply via email to