JosiahParry commented on issue #45438: URL: https://github.com/apache/arrow/issues/45438#issuecomment-2675789051
> I think you should be able to register the same function with different input/output signatures. This appears to be true! Though, it seems like the registration is _very_ strict. The [geoarrow extension metadata](https://geoarrow.org/extension-types.html#extension-metadata) contains projjson which we cannot cover all possible values. AFAICT, `{arrow}` requires that the schema matches identically. Is there any way to introduce flexibility? ``` r library(dplyr) library(geoarrow) # read in the nc shapefile nc <- sf::st_read(system.file("shape/nc.shp", package = "sf")) |> sf::st_cast("POLYGON") |> mutate(geometry = geoarrow::as_geoarrow_vctr(geometry)) |> as_tibble() |> arrow::as_arrow_table() # using default extension schema schema <- na_extension_geoarrow("POLYGON") |> nanoarrow::as_nanoarrow_schema() |> arrow::as_data_type() # register function using geoarrow polygon extension schema arrow::register_scalar_function( name = "area_udf", function(context, x) { geoarrowrs::area_euclidean_unsigned_(x) |> nanoarrow::as_nanoarrow_array_stream() }, in_type = schema, out_type = arrow::float64(), auto_convert = TRUE ) transmute(nc, area = area_udf(geometry)) #> Error in `map()`: #> ℹ In index: 1. #> ℹ With name: area. #> Caused by error: #> ! NotImplemented: Function 'area_udf' has no kernel matching input types (geoarrow.polygon <CRS: { #> "$schema": "https://pro...) ``` > I am not quite sure where the arrow package fits in with what you're trying to do. My hope is that we (R community) can start working on native arrow objects much more. My goal is to be able to allow users to use the geoarrow-rust functionality on an arrow table without having to convert to-and-from arrow tables and data.frames thus defeating the whole point of arrow (imo). > I wonder if you can "just" have R functions ... Yes, that is the hope! But at present, in order for a function to work with dplyr + arrow table it _must_ be registered as above because otherwise—regardless if the function accepts and returns nanoarrow/arrow it will collect the entire table as a data.frame first 👇🏽 ``` r library(dplyr) library(geoarrow) # read in the nc shapefile nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf")) |> sf::st_cast("POLYGON") |> mutate(geometry = as_geoarrow_vctr(geometry)) |> as_tibble() |> arrow::as_arrow_table() nc |> transmute( area = geoarrowrs:::area_euclidean_unsigned_(geometry) ) #> ℹ Expression not supported in Arrow #> → Pulling data into R #> # A tibble: 108 × 1 #> area #> <nnrrw_vc> #> 1 0.114283505 #> 2 0.061399756 #> 3 0.143016284 #> 4 0.058829023 #> 5 0.005169872 #> 6 0.005772081 #> 7 0.152759301 #> 8 0.097157559 #> 9 0.061880449 #> 10 0.090801907 #> # ℹ 98 more rows ``` I also am hoping that by doing this work we can allow geometry to _just_ be another column in a dataset without super fancy objects like sf which perform a bunch of magic. The goal is for it to be ["'boringly interoperable'."](https://cloudnativegeo.org/blog/2025/02/geoparquet-2.0-going-native/) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org