JosiahParry commented on issue #45438:
URL: https://github.com/apache/arrow/issues/45438#issuecomment-2675789051

   > I think you should be able to register the same function with different 
input/output signatures.
   
   This appears to be true! Though, it seems like the registration is _very_ 
strict. The [geoarrow extension 
metadata](https://geoarrow.org/extension-types.html#extension-metadata) 
contains projjson which we cannot cover all possible values. AFAICT, `{arrow}` 
requires that the schema matches identically. 
   
   Is there any way to introduce flexibility? 
   
   ``` r
   library(dplyr)
   library(geoarrow)
   
   # read in the nc shapefile 
   nc <- sf::st_read(system.file("shape/nc.shp", package = "sf")) |> 
     sf::st_cast("POLYGON") |> 
     mutate(geometry = geoarrow::as_geoarrow_vctr(geometry)) |> 
     as_tibble() |> 
     arrow::as_arrow_table()
   
   # using default extension schema
   schema <- na_extension_geoarrow("POLYGON") |> 
     nanoarrow::as_nanoarrow_schema() |> 
     arrow::as_data_type()
   
   # register function using geoarrow polygon extension schema
   arrow::register_scalar_function(
     name = "area_udf",
     function(context, x) {
       geoarrowrs::area_euclidean_unsigned_(x) |> 
       nanoarrow::as_nanoarrow_array_stream() 
     },
     in_type = schema,
     out_type = arrow::float64(),
     auto_convert = TRUE
   )
   
   transmute(nc, area = area_udf(geometry))
   #> Error in `map()`:
   #> ℹ In index: 1.
   #> ℹ With name: area.
   #> Caused by error:
   #> ! NotImplemented: Function 'area_udf' has no kernel matching input types 
(geoarrow.polygon <CRS: {
   #>   "$schema": "https://pro...)
   ```
   
   > I am not quite sure where the arrow package fits in with what you're 
trying to do. 
   
   My hope is that we (R community) can start working on native arrow objects 
much more. My goal is to be able to allow users to use the geoarrow-rust 
functionality on an arrow table without having to convert to-and-from arrow 
tables and data.frames thus defeating the whole point of arrow (imo).
   
   > I wonder if you can "just" have R functions ...
   
   Yes, that is the hope! But at present, in order for a function to work with 
dplyr + arrow table it _must_ be registered as above because 
otherwise—regardless if the function accepts and returns nanoarrow/arrow it 
will collect the entire table as a data.frame first
   
   👇🏽
   
   ``` r
   library(dplyr)
   library(geoarrow)
   
   # read in the nc shapefile 
   nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf")) |> 
     sf::st_cast("POLYGON") |> 
     mutate(geometry = as_geoarrow_vctr(geometry)) |> 
     as_tibble() |> 
     arrow::as_arrow_table()
   
   nc |>
     transmute(
       area = geoarrowrs:::area_euclidean_unsigned_(geometry)
     )
   #> ℹ Expression not supported in Arrow
   #> → Pulling data into R
   #> # A tibble: 108 × 1
   #>    area       
   #>    <nnrrw_vc> 
   #>  1 0.114283505
   #>  2 0.061399756
   #>  3 0.143016284
   #>  4 0.058829023
   #>  5 0.005169872
   #>  6 0.005772081
   #>  7 0.152759301
   #>  8 0.097157559
   #>  9 0.061880449
   #> 10 0.090801907
   #> # ℹ 98 more rows
   ```
   
   I also am hoping that by doing this work we can allow geometry to _just_ be 
another column in a dataset without super fancy objects like sf which perform a 
bunch of magic. The goal is for it to be ["'boringly 
interoperable'."](https://cloudnativegeo.org/blog/2025/02/geoparquet-2.0-going-native/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to