paleolimbot commented on PR #12817:
URL: https://github.com/apache/arrow/pull/12817#issuecomment-1099521412
Ok! This is ready for a review. The main motivation here is to allow other
packages to define conversions to Arrow objects. This is most useful for Table
objects, since we currently convert to Table before
`write_parquet/ipc_stream/feather/csv_arrow()` and because converting an R
vector, and for Array objects, because the R Vector -> Array conversion is at
the heart of the default conversion to Table. A motivating project for these
methods is 'geoarrow', which implements an ExtensionType to store geometry
columns (e.g., `sf::st_sfc()`) and table-level metadata for table-level objects
(e.g., `sf::st_sf()`).
Other methods I added here are important for lower-level packages like the
work-in-progress 'substrait' and 'narrow', where there are analogues for
`DataType` (e.g., the `narrow_schema()` or `substrait.Type`), `Schema` (e.g.,
the `narrow_schema()` or `substrait.NamedStruct`) , and `RecordBatchReader`
(e.g., the `narrow_array_stream()`). In particular, I think that
`as_record_batch_reader()` will be very useful (more flexible than `Table` for
things like database results that can be streamed into the writers).
Adding the S3 methods was reasonably straightforward; however, actually
using them was quite complicated because R Vector -> Array conversion is highly
optimized and mostly done in C++. The approach I took was to keep the existing
code, changing as little as possible, for R objects that we handle internally.
For other objects, the C++ code calls `type()` and `as_arrow_array()`. As is
apparent based on the number of files changed by this PR, that "simple"
approach lead to a whack-a-mole of test failures that I think I was able to
solve with a minimal footprint; however, another option is to close this PR and
tackle the changes in reverse order with smaller PRs.
In a nutshell, before, users were stuck with the built-in behaviour for
objects with a custom class. This was usually OK, but occasionally sub-optimal,
non-functional, or corupting. I've picked a probably rare example because
record-style vectors are new, but the current set of conversions assumes that
they're list-like (as opposed to data-frame like):
``` r
# install.packages("arrow")
library(arrow, warn.conflicts = FALSE)
tbl <- tibble::tibble(
x = 1:5,
points = wk::xy(x = 1:5, y = 6:10)
)
# wk::xy() is a record-style vector, like POSIXlt, with a vctrs
implementation
str(unclass(tbl$points))
#> List of 2
#> $ x: num [1:5] 1 2 3 4 5
#> $ y: num [1:5] 6 7 8 9 10
# in the release version this this fails
Table$create(tbl)
#> Error: Invalid: All columns must have the same length
# ...or generates bogus output
as.data.frame(Table$create(x = 1:2, points = tbl$points))
#> Warning: Invalid metadata$r
#> # A tibble: 2 × 2
#> x points
#> <int> <list<double>>
#> 1 1 [5]
#> 2 2 [5]
```
After this PR you can define `type()` and `as_arrow_array()` and the
conversion should "just work". This is particularly useful in conjunction with
the new extension type support, which can handle most (non-list-based) vctr
classes (e.g., this PR removes the internal conversions for POSIXlt and haven
types because the vctrs extension array handles them out of the box).
``` r
# remotes::install_github("apache/arrow/r#12817")
library(arrow, warn.conflicts = FALSE)
tbl <- tibble::tibble(
x = 1:5,
points = wk::xy(x = 1:5, y = 6:10)
)
# wk::xy() is a record-style vector, like POSIXlt, with a vctrs
implementation
str(unclass(tbl$points))
#> List of 2
#> $ x: num [1:5] 1 2 3 4 5
#> $ y: num [1:5] 6 7 8 9 10
# this now fails:
tf <- tempfile()
write_feather(tbl, tf)
#> Error:
#> ! Can't infer Arrow data type from object inheriting from wk_xy / wk_rcrd
# until...
type.wk_xy <- function(x, ...) {
vctrs_extension_type(vctrs::vec_ptype(x))
}
as_arrow_array.wk_xy <- function(x, ...) {
vctrs_extension_array(x)
}
# now works!
write_feather(tbl, tf)
read_feather(tf)
#> # A tibble: 5 × 2
#> x points
#> <int> <wk_xy>
#> 1 1 (1 6)
#> 2 2 (2 7)
#> 3 3 (3 8)
#> 4 4 (4 9)
#> 5 5 (5 10)
# if for some reason the extension type is not loaded, we get the storage
type
# with no warning (maybe not ideal?)
arrow::unregister_extension_type("arrow.r.vctrs")
read_feather(tf)
#> # A tibble: 5 × 2
#> x points$x $y
#> <int> <dbl> <dbl>
#> 1 1 1 6
#> 2 2 2 7
#> 3 3 3 8
#> 4 4 4 9
#> 5 5 5 10
```
<sup>Created on 2022-04-14 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.1)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]