JosiahParry commented on issue #45438: URL: https://github.com/apache/arrow/issues/45438#issuecomment-2675242573
Thanks everyone! The package I have been developing works on the native C FFI struct interface. I'm not sure I follow as to why DataFusion or another query engine is necessary here when the computations already happen on native arrow arrays. The package already works with `nanoarrow` arrays and array streams by passing the arrow pointers to Rust and using the [C FFI interface](https://docs.rs/arrow/latest/arrow/ffi/index.html), my hope it to get it to work with the arrow R package representation of RecordBatches and native arrays etc. I think UDFs _are_ what would be useful. Though it's unclear to me if the current state of the package will be flexible enough. I was able to get a single example working using a the `register_scalar_function()`. ```r library(dplyr) library(geoarrow) devtools::load_all() #> ℹ Loading geoarrowrs # read in the nc shapefile nc_df <- sf::st_read(system.file("shape/nc.shp", package = "sf")) |> sf::st_cast("POLYGON") |> mutate(geometry = geoarrow::as_geoarrow_vctr(geometry)) |> as_tibble() # extract the schema from the input data schema <- as_geoarrow_schema(nc_df$geometry) |> nanoarrow::as_nanoarrow_schema() |> arrow::as_data_type() # FIXME this should be able to support POLYGON, MULTIPOLYGON, and any CRS in the schema # I'd like to be able to specify _just_ the extension type(s) arrow::register_scalar_function( name = "area_udf", function(context, x) { area_euclidean_unsigned_(x) |> nanoarrow::as_nanoarrow_array_stream() }, in_type = schema, out_type = arrow::float64(), auto_convert = TRUE ) nc <- arrow::as_arrow_table(nc_df) transmute(nc, area = area_udf(geometry)) |> collect() #> # A tibble: 108 × 1 #> area #> * <dbl> #> 1 0.114 #> 2 0.0614 #> 3 0.143 #> 4 0.0588 #> 5 0.00517 #> 6 0.00577 #> 7 0.153 #> 8 0.0972 #> 9 0.0619 #> 10 0.0908 #> # ℹ 98 more rows ``` I think one of the challenges with this is that the schema isn't necessarily _fixed_ for each geometry extension type. For example the CRS may be different. And, for example, this function works for `geometry.polygon` _or_ `geometry.multipolgon` extension types. This is applicable for both `in_type` and `out_type` Perhaps @paleolimbot may have more insight here as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org