[GitHub] [arrow] drin commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

GitBox Fri, 08 Jul 2022 12:53:02 -0700


drin commented on code in PR #13509:
URL: https://github.com/apache/arrow/pull/13509#discussion_r917100523



##########
r/tests/testthat/test-dataset.R:
##########
@@ -618,6 +618,33 @@ test_that("UnionDataset handles InMemoryDatasets", {
   expect_equal(actual, expected)
 })
 
+test_that("scalar aggregates with many batches", {
+  test_data <- data.frame(val = 1:1e7)
+  expected_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      test_data                              %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  ds_tmpfile <- tempfile("test-aggregate", fileext = ".parquet")
+  arrow::write_parquet(test_data, ds_tmpfile)
+  actual_result_distr <- (
+    sapply(1:100, function(iter_ndx) {
+      arrow::open_dataset(ds_tmpfile)        %>%
+        dplyr::summarise(min_val = min(val)) %>%
+        dplyr::collect()                     %>%
+        dplyr::pull(min_val)
+    }) %>%
+      table()
+  )
+
+  expect_equal(actual_result_distr, expected_result_distr)

Review Comment:
   The implementation was trying to make sure it always resolves to 1. I guess 
it could have been done in a loop instead of gathering everything into a table 
to be inspected at the end.
   
   I took Neal's suggestion which simplifies the test body by doing this (using 
`replicate` and `all`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] drin commented on a diff in pull request #13509: ARROW-16904: [C++] min/max not deterministic if Parquet files have multiple row groups

Reply via email to