paleolimbot commented on PR #12817:
URL: https://github.com/apache/arrow/pull/12817#issuecomment-1099521412

   Ok! This is ready for a review. The main motivation here is to allow other 
packages to define conversions to Arrow objects. This is most useful for Table 
objects, since we currently convert to Table before 
`write_parquet/ipc_stream/feather/csv_arrow()` and because converting an R 
vector, and for Array objects, because the R Vector -> Array conversion is at 
the heart of the default conversion to Table. A motivating project for these 
methods is 'geoarrow', which implements an ExtensionType to store geometry 
columns (e.g., `sf::st_sfc()`) and table-level metadata for table-level objects 
(e.g., `sf::st_sf()`).
   
   Other methods I added here are important for lower-level packages like the 
work-in-progress 'substrait' and 'narrow', where there are analogues for 
`DataType` (e.g., the `narrow_schema()` or `substrait.Type`), `Schema` (e.g., 
the `narrow_schema()` or `substrait.NamedStruct`) , and `RecordBatchReader` 
(e.g., the `narrow_array_stream()`). In particular, I think that 
`as_record_batch_reader()` will be very useful (more flexible than `Table` for 
things like database results that can be streamed into the writers).
   
   Adding the S3 methods was reasonably straightforward; however, actually 
using them was quite complicated because R Vector -> Array conversion is highly 
optimized and mostly done in C++. The approach I took was to keep the existing 
code, changing as little as possible, for R objects that we handle internally. 
For other objects, the C++ code calls `type()` and `as_arrow_array()`. As is 
apparent based on the number of files changed by this PR, that "simple" 
approach lead to a whack-a-mole of test failures that I think I was able to 
solve with a minimal footprint; however, another option is to close this PR and 
tackle the changes in reverse order with smaller PRs.
   
   In a nutshell, before, users were stuck with the built-in behaviour for 
objects with a custom class. This was usually OK, but occasionally sub-optimal, 
non-functional, or corupting. I've picked a probably rare example because 
record-style vectors are new, but the current set of conversions assumes that 
they're list-like (as opposed to data-frame like):
   
   ``` r
   # install.packages("arrow")
   library(arrow, warn.conflicts = FALSE)
   
   tbl <- tibble::tibble(
     x = 1:5,
     points = wk::xy(x = 1:5, y = 6:10)
   )
   
   # wk::xy() is a record-style vector, like POSIXlt, with a vctrs 
implementation
   str(unclass(tbl$points))
   #> List of 2
   #>  $ x: num [1:5] 1 2 3 4 5
   #>  $ y: num [1:5] 6 7 8 9 10
   
   # in the release version this this fails
   Table$create(tbl)
   #> Error: Invalid: All columns must have the same length
   
   # ...or generates bogus output
   as.data.frame(Table$create(x = 1:2, points = tbl$points))
   #> Warning: Invalid metadata$r
   #> # A tibble: 2 × 2
   #>       x         points
   #>   <int> <list<double>>
   #> 1     1            [5]
   #> 2     2            [5]
   ```
   
   After this PR you can define `type()` and `as_arrow_array()` and the 
conversion should "just work". This is particularly useful in conjunction with 
the new extension type support, which can handle most (non-list-based) vctr 
classes (e.g., this PR removes the internal conversions for POSIXlt and haven 
types because the vctrs extension array handles them out of the box).
   
   ``` r
   # remotes::install_github("apache/arrow/r#12817")
   library(arrow, warn.conflicts = FALSE)
   
   tbl <- tibble::tibble(
     x = 1:5,
     points = wk::xy(x = 1:5, y = 6:10)
   )
   
   # wk::xy() is a record-style vector, like POSIXlt, with a vctrs 
implementation
   str(unclass(tbl$points))
   #> List of 2
   #>  $ x: num [1:5] 1 2 3 4 5
   #>  $ y: num [1:5] 6 7 8 9 10
   
   # this now fails:
   tf <- tempfile()
   write_feather(tbl, tf)
   #> Error:
   #> ! Can't infer Arrow data type from object inheriting from wk_xy / wk_rcrd
   
   # until...
   type.wk_xy <- function(x, ...) {
     vctrs_extension_type(vctrs::vec_ptype(x))
   }
   
   as_arrow_array.wk_xy <- function(x, ...) {
     vctrs_extension_array(x)
   }
   
   # now works!
   write_feather(tbl, tf)
   read_feather(tf)
   #> # A tibble: 5 × 2
   #>       x points 
   #>   <int> <wk_xy>
   #> 1     1 (1  6) 
   #> 2     2 (2  7) 
   #> 3     3 (3  8) 
   #> 4     4 (4  9) 
   #> 5     5 (5 10)
   
   # if for some reason the extension type is not loaded, we get the storage 
type
   # with no warning (maybe not ideal?)
   arrow::unregister_extension_type("arrow.r.vctrs")
   read_feather(tf)
   #> # A tibble: 5 × 2
   #>       x points$x    $y
   #>   <int>    <dbl> <dbl>
   #> 1     1        1     6
   #> 2     2        2     7
   #> 3     3        3     8
   #> 4     4        4     9
   #> 5     5        5    10
   ```
   
   <sup>Created on 2022-04-14 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to