paleolimbot commented on PR #13397:
URL: https://github.com/apache/arrow/pull/13397#issuecomment-1177991132
I *think* I've incorporated all the comments here - I've summarise the
unresolved bits below but feel free to add to that list.
I agree that the "the whole entire plan must be completely evaluated in one
call into C++ from R" constraint is not ideal and I'm not offended if we want
to bump this to the next release to see if we can do it better. It's a new
feature and I think it's OK that we include it and let users give feedback on
ways that user-defined functions can be improved (which may include support for
the R-level record batch reader).
I included improvements to `SafeCallIntoR<>()` / `RunWithCapturedR()` in
this PR because it the like the bad error messages and code complexity of using
them was becoming particularly evident. I'm happy to remove those changes and
put them in another PR, too, since they widen the scope of this PR beyond just
UDFs.
A motivating example from the geospatial end of things that might be more
fun to play with...it does highlight some of the complexities with matching
extension types which is not all that well supported yet.
<details>
``` r
# remotes::install_github("apache/arrow#13397")
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()`
for more information.
library(dplyr, warn.conflicts = FALSE)
# remotes::install_github("paleolimbot/geoarrow")
library(geoarrow)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
# (need a better generator for this in geoarrow)
geoarrow_wkb_type_arrow <- arrow:::DataType$import_from_c(
narrow::as_narrow_schema(geoarrow_wkb())
)
# scalar function wrapper
st_perimeter_wrapper <- arrow_scalar_function(
function(x) {
sf::st_length(sf::st_boundary(sf::st_as_sfc(x)))
},
in_type = schema(x = geoarrow_wkb_type_arrow),
out_type = float64()
)
# register!
register_user_defined_function(st_perimeter_wrapper, "st_perimeter")
# some example data
nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
# parameterized extension types (e.g., with crs) don't match the kernel
signature
sf::st_crs(nc) <- NA_crs_
nc_table <- as_geoarrow_table(nc, schema = geoarrow_schema_wkb())
# use in a pipeline
nc_table |>
transmute(NAME, len = st_perimeter(geometry)) |>
collect()
#> # A tibble: 100 × 2
#> NAME len
#> <chr> <dbl>
#> 1 Ashe 1.44
#> 2 Alleghany 1.23
#> 3 Surry 1.63
#> 4 Currituck 2.97
#> 5 Northampton 2.21
#> 6 Hertford 1.67
#> 7 Camden 1.55
#> 8 Gates 1.28
#> 9 Warren 1.42
#> 10 Stokes 1.43
#> # … with 90 more rows
# check answers
nc |>
transmute(NAME, len = sf::st_length(sf::st_boundary(geometry)))
#> Simple feature collection with 100 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax:
36.58965
#> CRS: NA
#> # A tibble: 100 × 3
#> NAME len
geometry
#> * <chr> <dbl>
<MULTIPOLYGON>
#> 1 Ashe 1.44 (((-81.47276 36.23436, -81.54084 36.27251, -81.56198
36.27…
#> 2 Alleghany 1.23 (((-81.23989 36.36536, -81.24069 36.37942, -81.26284
36.40…
#> 3 Surry 1.63 (((-80.45634 36.24256, -80.47639 36.25473, -80.53688
36.25…
#> 4 Currituck 2.97 (((-76.00897 36.3196, -76.01735 36.33773, -76.03288
36.335…
#> 5 Northampton 2.21 (((-77.21767 36.24098, -77.23461 36.2146, -77.29861
36.211…
#> 6 Hertford 1.67 (((-76.74506 36.23392, -76.98069 36.23024, -76.99475
36.23…
#> 7 Camden 1.55 (((-76.00897 36.3196, -75.95718 36.19377, -75.98134
36.169…
#> 8 Gates 1.28 (((-76.56251 36.34057, -76.60424 36.31498, -76.64822
36.31…
#> 9 Warren 1.42 (((-78.30876 36.26004, -78.28293 36.29188, -78.32125
36.54…
#> 10 Stokes 1.43 (((-80.02567 36.25023, -80.45301 36.25709, -80.43531
36.55…
#> # … with 90 more rows
```
<sup>Created on 2022-07-07 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.1)</sup>
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]