Re: [I] [R] creating arrow supported expressions [arrow]

via GitHub Fri, 21 Feb 2025 10:17:51 -0800


JosiahParry commented on issue #45438:
URL: https://github.com/apache/arrow/issues/45438#issuecomment-2675242573


   Thanks everyone! The package I have been developing works on the native C 
FFI struct interface. I'm not sure I follow as to why DataFusion or another 
query engine is necessary here when the computations already happen on native 
arrow arrays. 
   
   The package already works with `nanoarrow` arrays and array streams by 
passing the arrow pointers to Rust and using the [C FFI 
interface](https://docs.rs/arrow/latest/arrow/ffi/index.html), my hope it to 
get it to work with the arrow R package representation of RecordBatches and 
native arrays etc.
   
   I think UDFs _are_ what would be useful. Though it's unclear to me if the 
current state of the package will be flexible enough. 
   
   I was able to get a single example working using a the 
`register_scalar_function()`.
   
   ```r
   library(dplyr)
   library(geoarrow)
   devtools::load_all()
   #> ℹ Loading geoarrowrs
   
   # read in the nc shapefile 
   nc_df <- sf::st_read(system.file("shape/nc.shp", package = "sf")) |> 
     sf::st_cast("POLYGON") |> 
     mutate(geometry = geoarrow::as_geoarrow_vctr(geometry)) |> 
     as_tibble() 
   
   # extract the schema from the input data
   schema <- as_geoarrow_schema(nc_df$geometry) |> 
       nanoarrow::as_nanoarrow_schema() |> 
       arrow::as_data_type()
   
   # FIXME this should be able to support POLYGON, MULTIPOLYGON, and any CRS in 
the schema
   # I'd like to be able to specify _just_ the extension type(s)
   arrow::register_scalar_function(
     name = "area_udf",
     function(context, x) {
     area_euclidean_unsigned_(x) |> 
       nanoarrow::as_nanoarrow_array_stream() 
     },
     in_type = schema,
     out_type = arrow::float64(),
     auto_convert = TRUE
   )
   
   nc <- arrow::as_arrow_table(nc_df)
   
   transmute(nc, area = area_udf(geometry)) |> 
     collect()
   #> # A tibble: 108 × 1
   #>       area
   #>  *   <dbl>
   #>  1 0.114  
   #>  2 0.0614 
   #>  3 0.143  
   #>  4 0.0588 
   #>  5 0.00517
   #>  6 0.00577
   #>  7 0.153  
   #>  8 0.0972 
   #>  9 0.0619 
   #> 10 0.0908 
   #> # ℹ 98 more rows
   ```
   
   I think one of the challenges with this is that the schema isn't necessarily 
_fixed_ for each geometry extension type. For example the CRS may be different. 
And, for example, this function works for `geometry.polygon` _or_ 
`geometry.multipolgon` extension types. This is applicable for both `in_type` 
and `out_type`
   
   Perhaps @paleolimbot may have more insight here as well. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [R] creating arrow supported expressions [arrow]

Reply via email to