[GitHub] [arrow] jonkeane commented on a change in pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

GitBox Fri, 18 Mar 2022 11:46:46 -0700


jonkeane commented on a change in pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#discussion_r830261151




##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -544,6 +557,43 @@ test_that("Creating UnionDataset", {
   expect_error(c(ds1, 42), "character")
 })
 
+test_that("UnionDataset can merge schemas", {
+  sub_df1 <- Table$create(
+    x = Array$create(c(1, 2, 3)),
+    y = Array$create(c("a", "b", "c"))
+  )
+  sub_df2 <- Table$create(
+    x = Array$create(c(4, 5)),
+    z = Array$create(c("d", "e"))
+  )
+
+  path1 <- make_temp_dir()
+  path2 <- make_temp_dir()
+  write_dataset(sub_df1, path1, format = "parquet")
+  write_dataset(sub_df2, path2, format = "parquet")
+
+  ds1 <- open_dataset(path1, format = "parquet")
+  ds2 <- open_dataset(path2, format = "parquet")
+
+  ds <- c(ds1, ds2)
+  actual <- ds %>%
+    collect() %>%
+    arrange(x)
+  expect_equal(
+    actual,
+    union_all(as_tibble(sub_df1), as_tibble(sub_df2))
+  )

Review comment:
       I can see argument for either side of this: might we want to assert the 
actual column names? I know that `union_all` preserves all the columns from 
both, but it might be nice to have that super obvious that's what's happening 
in the test

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -544,6 +557,43 @@ test_that("Creating UnionDataset", {
   expect_error(c(ds1, 42), "character")
 })
 
+test_that("UnionDataset can merge schemas", {
+  sub_df1 <- Table$create(
+    x = Array$create(c(1, 2, 3)),
+    y = Array$create(c("a", "b", "c"))
+  )
+  sub_df2 <- Table$create(
+    x = Array$create(c(4, 5)),
+    z = Array$create(c("d", "e"))
+  )
+
+  path1 <- make_temp_dir()
+  path2 <- make_temp_dir()
+  write_dataset(sub_df1, path1, format = "parquet")
+  write_dataset(sub_df2, path2, format = "parquet")
+
+  ds1 <- open_dataset(path1, format = "parquet")
+  ds2 <- open_dataset(path2, format = "parquet")
+
+  ds <- c(ds1, ds2)
+  actual <- ds %>%
+    collect() %>%
+    arrange(x)
+  expect_equal(
+    actual,
+    union_all(as_tibble(sub_df1), as_tibble(sub_df2))
+  )
+
+  # without unifying schemas, takes the first schema

Review comment:
       ```suggestion
     # without unifying schemas, takes the first schema and discards any 
columns in the second which aren't in the first
   ```
   
   A bit more descriptive comment (it took a second to see what was going on 
here when reading it)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on a change in pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Reply via email to