Will Jones created ARROW-15627:
----------------------------------
Summary: [R] Support unify_schemas for union datasets
Key: ARROW-15627
URL: https://issues.apache.org/jira/browse/ARROW-15627
Project: Apache Arrow
Issue Type: Improvement
Components: R
Affects Versions: 7.0.0
Reporter: Will Jones
Fix For: 8.0.0
Also out of discussion on [https://github.com/apache/arrow/issues/12371]
You can unify schemas between different parquet files, but it seems like you
can't union together two (or more) datasets that have different schemas. This
is odd, because we do compute the unified schema onĀ [this
line|https://github.com/apache/arrow/blob/ba0814e60a451525dd5492b68059aad8a4bdaf4f/r/R/dataset.R#L189],
only to later assert all the schemas are the same.
{code:R}
library(arrow)
library(dplyr)
df1 <- arrow_table(x = array(c(1, 2, 3)),
y = array(c("a", "b", "c")))
df2 <- arrow_table(x = array(c(4, 5)),
z = array(c("d", "e")))
df1 %>% write_dataset("example1", format="parquet")
df2 %>% write_dataset("example2", format="parquet")
ds1 <- open_dataset("example1", format="parquet")
ds2 <- open_dataset("example2", format="parquet")
# These don't work
ds <- c(ds1, ds2) # c() actually does the same thing
ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema
ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas =
TRUE)
# This does
ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"),
format="parquet", unify_schemas = TRUE)
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)