Dewey Dunnington created ARROW-16670:
----------------------------------------

             Summary: [R] Behaviour of R-specific key/value metadata in the 
query engine
                 Key: ARROW-16670
                 URL: https://issues.apache.org/jira/browse/ARROW-16670
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Dewey Dunnington


In ARROW-16607 there are some changes to metadata handling in the 
{{arrow_dplyr_query}}. With extension type support, more column types (like 
sf::sfc) can be supported, and with growing support for column types comes a 
greater chance that our current metadata restoration by default policy will 
cause difficult-to-work-around errors. The latest one I have run across is this 
one:

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
# required for write_dataset(nc) to work
# remotes::install_github("paleolimbot/geoarrow")
library(geoarrow)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE

nc <- read_sf(system.file("shape/nc.shp", package = "sf"))
tf <- tempfile()
write_dataset(nc, tf)

open_dataset(tf) %>% 
  select(NAME, FIPS) %>% 
  collect()
#> Error in st_geometry.sf(x): attr(obj, "sf_column") does not point to a 
geometry column.
#> Did you rename it, without setting st_geometry(obj) <- "newname"?
{code}

This causes an error because the restored class has assumptions about the 
contents of the data frame that we can't necessarily know about (or would have 
to hard code for every data frame subclass).

I can see why {{arrow::write_parquet()}} and {{arrow::read_parquet()}} (and 
feather, ipc_stream) might want to do this to faithfully roundtrip a data 
frame, and because the write/read roundtrip (usually) involves the same columns 
and the same rows, it's probably safe to restore metadata by default.

 The query engine does a lot of transformations that can break assumptions like 
the one I've shown above (where sf expects a certain column to exist and errors 
otherwise in a way that the user can't work around). Rather than hard-code the 
assumptions of every data.frame and vector subclass, I wonder if ignoring the R 
metadata for query engine output would be a better strategy. If it's not the 
default, it would be nice to provide an escape hatch for users or developers 
that find themselves in this position with no workaround.

With the addition of the vctrs extension type, there is a route to preserve 
attributes through the query engine (although it's a bit verbose). We could 
make it easier to do (e.g., by interpreting `I()` or `rlang::box()` in some 
way).

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- data.frame(int_col = 1:5)
attr(df$int_col, "some_attr") <- "some_value"

tf <- tempfile()

#  attributes dropped when column is renamed
write_dataset(df, tf)

open_dataset(tf) %>% 
  select(other_int_col = int_col) %>% 
  collect() %>% 
  pull()
#> [1] 1 2 3 4 5

# attributes preserved when column is renamed
table <- arrow_table(int_col = vctrs_extension_array(df$int_col))
write_dataset(table, tf)

open_dataset(tf) %>% 
  select(other_int_col = int_col) %>% 
  collect() %>% 
  pull()
#> [1] 1 2 3 4 5
#> attr(,"some_attr")
#> [1] "some_value"
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to