Dewey Dunnington created ARROW-16670:
----------------------------------------
Summary: [R] Behaviour of R-specific key/value metadata in the
query engine
Key: ARROW-16670
URL: https://issues.apache.org/jira/browse/ARROW-16670
Project: Apache Arrow
Issue Type: Improvement
Components: R
Reporter: Dewey Dunnington
In ARROW-16607 there are some changes to metadata handling in the
{{arrow_dplyr_query}}. With extension type support, more column types (like
sf::sfc) can be supported, and with growing support for column types comes a
greater chance that our current metadata restoration by default policy will
cause difficult-to-work-around errors. The latest one I have run across is this
one:
{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
# required for write_dataset(nc) to work
# remotes::install_github("paleolimbot/geoarrow")
library(geoarrow)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
tf <- tempfile()
write_dataset(nc, tf)
open_dataset(tf) %>%
select(NAME, FIPS) %>%
collect()
#> Error in st_geometry.sf(x): attr(obj, "sf_column") does not point to a
geometry column.
#> Did you rename it, without setting st_geometry(obj) <- "newname"?
{code}
This causes an error because the restored class has assumptions about the
contents of the data frame that we can't necessarily know about (or would have
to hard code for every data frame subclass).
I can see why {{arrow::write_parquet()}} and {{arrow::read_parquet()}} (and
feather, ipc_stream) might want to do this to faithfully roundtrip a data
frame, and because the write/read roundtrip (usually) involves the same columns
and the same rows, it's probably safe to restore metadata by default.
The query engine does a lot of transformations that can break assumptions like
the one I've shown above (where sf expects a certain column to exist and errors
otherwise in a way that the user can't work around). Rather than hard-code the
assumptions of every data.frame and vector subclass, I wonder if ignoring the R
metadata for query engine output would be a better strategy. If it's not the
default, it would be nice to provide an escape hatch for users or developers
that find themselves in this position with no workaround.
With the addition of the vctrs extension type, there is a route to preserve
attributes through the query engine (although it's a bit verbose). We could
make it easier to do (e.g., by interpreting `I()` or `rlang::box()` in some
way).
{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(int_col = 1:5)
attr(df$int_col, "some_attr") <- "some_value"
tf <- tempfile()
# attributes dropped when column is renamed
write_dataset(df, tf)
open_dataset(tf) %>%
select(other_int_col = int_col) %>%
collect() %>%
pull()
#> [1] 1 2 3 4 5
# attributes preserved when column is renamed
table <- arrow_table(int_col = vctrs_extension_array(df$int_col))
write_dataset(table, tf)
open_dataset(tf) %>%
select(other_int_col = int_col) %>%
collect() %>%
pull()
#> [1] 1 2 3 4 5
#> attr(,"some_attr")
#> [1] "some_value"
{code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)