jonkeane commented on a change in pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#discussion_r830261151
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -544,6 +557,43 @@ test_that("Creating UnionDataset", {
expect_error(c(ds1, 42), "character")
})
+test_that("UnionDataset can merge schemas", {
+ sub_df1 <- Table$create(
+ x = Array$create(c(1, 2, 3)),
+ y = Array$create(c("a", "b", "c"))
+ )
+ sub_df2 <- Table$create(
+ x = Array$create(c(4, 5)),
+ z = Array$create(c("d", "e"))
+ )
+
+ path1 <- make_temp_dir()
+ path2 <- make_temp_dir()
+ write_dataset(sub_df1, path1, format = "parquet")
+ write_dataset(sub_df2, path2, format = "parquet")
+
+ ds1 <- open_dataset(path1, format = "parquet")
+ ds2 <- open_dataset(path2, format = "parquet")
+
+ ds <- c(ds1, ds2)
+ actual <- ds %>%
+ collect() %>%
+ arrange(x)
+ expect_equal(
+ actual,
+ union_all(as_tibble(sub_df1), as_tibble(sub_df2))
+ )
Review comment:
I can see argument for either side of this: might we want to assert the
actual column names? I know that `union_all` preserves all the columns from
both, but it might be nice to have that super obvious that's what's happening
in the test
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -544,6 +557,43 @@ test_that("Creating UnionDataset", {
expect_error(c(ds1, 42), "character")
})
+test_that("UnionDataset can merge schemas", {
+ sub_df1 <- Table$create(
+ x = Array$create(c(1, 2, 3)),
+ y = Array$create(c("a", "b", "c"))
+ )
+ sub_df2 <- Table$create(
+ x = Array$create(c(4, 5)),
+ z = Array$create(c("d", "e"))
+ )
+
+ path1 <- make_temp_dir()
+ path2 <- make_temp_dir()
+ write_dataset(sub_df1, path1, format = "parquet")
+ write_dataset(sub_df2, path2, format = "parquet")
+
+ ds1 <- open_dataset(path1, format = "parquet")
+ ds2 <- open_dataset(path2, format = "parquet")
+
+ ds <- c(ds1, ds2)
+ actual <- ds %>%
+ collect() %>%
+ arrange(x)
+ expect_equal(
+ actual,
+ union_all(as_tibble(sub_df1), as_tibble(sub_df2))
+ )
+
+ # without unifying schemas, takes the first schema
Review comment:
```suggestion
# without unifying schemas, takes the first schema and discards any
columns in the second which aren't in the first
```
A bit more descriptive comment (it took a second to see what was going on
here when reading it)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]