[jira] [Created] (ARROW-15627) [R] Support unify_schemas for union datasets

Will Jones (Jira) Wed, 09 Feb 2022 07:59:07 -0800

Will Jones created ARROW-15627:
----------------------------------

             Summary: [R] Support unify_schemas for union datasets
                 Key: ARROW-15627
                 URL: https://issues.apache.org/jira/browse/ARROW-15627
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
    Affects Versions: 7.0.0
            Reporter: Will Jones
             Fix For: 8.0.0



Also out of discussion on [https://github.com/apache/arrow/issues/12371]

You can unify schemas between different parquet files, but it seems like you 
can't union together two (or more) datasets that have different schemas. This 
is odd, because we do compute the unified schema on [this 
line|https://github.com/apache/arrow/blob/ba0814e60a451525dd5492b68059aad8a4bdaf4f/r/R/dataset.R#L189],
 only to later assert all the schemas are the same.

{code:R}
library(arrow)
library(dplyr)

df1 <- arrow_table(x = array(c(1, 2, 3)),
                   y = array(c("a", "b", "c")))
df2 <- arrow_table(x = array(c(4, 5)),
                   z = array(c("d", "e")))

df1 %>% write_dataset("example1", format="parquet")
df2 %>% write_dataset("example2", format="parquet")

ds1 <- open_dataset("example1", format="parquet")
ds2 <- open_dataset("example2", format="parquet")

# These don't work
ds <- c(ds1, ds2) # c() actually does the same thing
ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema
ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas = 
TRUE)

# This does
ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"), 
format="parquet", unify_schemas = TRUE)
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15627) [R] Support unify_schemas for union datasets

Reply via email to