paleolimbot commented on PR #13397:
URL: https://github.com/apache/arrow/pull/13397#issuecomment-1186071656
Thank you everybody for reviews! There are a lot of threads here so I'm
going to try to summarise which bits were unresolved and how I think I resolved
them (or maybe not). The CI failures on r-devel are because of a recent commit
to r-devel; the CI failure on Windows is a lintr error that I don't know how to
fix (cyclomatic complexity of an R6 object?).
- @pitrou - The CI failures I thought I saw as a result of your suggestions
were actually something else, so all those improvements to safe-call-into-r.h
made it here. Thank you!
- @pitrou, @nealrichardson, @jorisvandenbossche I reverted the interface to
match the Python interface more closely. While `register_scalar_function()` is
perhaps not as intuitive as `register_user_defined_function()`, it sounds like
there was some discussion about its name and I think it's best to defer to the
result of that discussion (we can revisit the syntax if it does indeed prove to
be confusing to R users). This included nixing the "advanced" API...all R
functions are passed by the user as `function(context, x, y, ...) { ... }`. I
did add an `auto_convert` argument: if `auto_convert = TRUE`, the arguments and
output get converted so that the function only has to deal with R objects
(which makes it substantially easier to write the function).
- @pitrou I didn't add an R API to specify documentation. If there becomes a
way that compute function documentation is accessible and useful in R, I think
that is the right time to add that API to user-defined functions.
- @nealrichardson I changed the user-facing example to the one you suggested
(predict method of a linear model). For the tests I kept something simpler
since there's a few variations that are needed to test all the cases.
- @nealrichardson expressed concern that because we can't use user-defined
functions when we return an R-level RecordBatchReader that it may be confusing
when this fails. I think this only affects `to_duckb()` which is relatively
new, and @westonpace indicated that this should be possible in the future
(perhaps some details about that might help with this concern?).
A concern I have that never came up is that there is probably more overhead
than is needed for each call into R. We can probably get it down to a single
unwind protect...I think that would be best handled in a follow-up PR with a
narrow scope but I could in theory try here too.
Example usage with the final API:
<details>
``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()`
for more information.
library(dplyr, warn.conflicts = FALSE)
some_model <- lm(mpg ~ disp + cyl, data = mtcars)
register_scalar_function(
"mtcars_predict_mpg",
function(context, disp, cyl) {
predict(some_model, newdata = data.frame(disp, cyl))
},
in_type = schema(disp = float64(), cyl = float64()),
out_type = float64(),
auto_convert = TRUE
)
as_arrow_table(mtcars) %>%
transmute(mpg, mpg_predicted = mtcars_predict_mpg(disp, cyl)) %>%
collect() %>%
head()
#> mpg mpg_predicted
#> 1 21.0 21.84395
#> 2 21.0 21.84395
#> 3 22.8 26.08886
#> 4 21.4 19.82676
#> 5 18.7 14.55267
#> 6 18.1 20.50602
```
<sup>Created on 2022-07-15 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.1)</sup>
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]