[
https://issues.apache.org/jira/browse/ARROW-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Will Jones updated ARROW-15627:
-------------------------------
Issue Type: Bug (was: Improvement)
> [R] Support unify_schemas for union datasets
> --------------------------------------------
>
> Key: ARROW-15627
> URL: https://issues.apache.org/jira/browse/ARROW-15627
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 7.0.0
> Reporter: Will Jones
> Assignee: Will Jones
> Priority: Minor
> Labels: dataset, pull-request-available
> Fix For: 8.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Also out of discussion on [https://github.com/apache/arrow/issues/12371]
> You can unify schemas between different parquet files, but it seems like you
> can't union together two (or more) datasets that have different schemas. This
> is odd, because we do compute the unified schema onĀ [this
> line|https://github.com/apache/arrow/blob/ba0814e60a451525dd5492b68059aad8a4bdaf4f/r/R/dataset.R#L189],
> only to later assert all the schemas are the same.
> {code:R}
> library(arrow)
> library(dplyr)
> df1 <- arrow_table(x = array(c(1, 2, 3)),
> y = array(c("a", "b", "c")))
> df2 <- arrow_table(x = array(c(4, 5)),
> z = array(c("d", "e")))
> df1 %>% write_dataset("example1", format="parquet")
> df2 %>% write_dataset("example2", format="parquet")
> ds1 <- open_dataset("example1", format="parquet")
> ds2 <- open_dataset("example2", format="parquet")
> # These don't work
> ds <- c(ds1, ds2) # c() actually does the same thing
> ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema
> ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas
> = TRUE)
> # This does
> ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"),
> format="parquet", unify_schemas = TRUE)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)