jonkeane commented on a change in pull request #9972:
URL: https://github.com/apache/arrow/pull/9972#discussion_r612498334
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1778,3 +1778,60 @@ test_that("Collecting zero columns from a dataset
doesn't return entire dataset"
c(32, 0)
)
})
+
+# see https://issues.apache.org/jira/browse/ARROW-12315
+test_that("Max partitions fails with non-integer values and less than required
partitions values", {
+ skip_if_not_available("parquet")
+ tmp <- tempfile()
+
+ # this example needs 3 partitions
+
+ # max_partitions = chr => error
+ expect_error(
+ mtcars %>%
+ group_by(cyl) %>%
+ write_dataset(tmp, format = "parquet", max_partitions = "foobar")
+ )
Review comment:
We should assert what each of these errors contain. We don't need to do
the full thing, but let's make sure that they are erroring with something
useful about partitions
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1778,3 +1778,60 @@ test_that("Collecting zero columns from a dataset
doesn't return entire dataset"
c(32, 0)
)
})
+
+# see https://issues.apache.org/jira/browse/ARROW-12315
+test_that("Max partitions fails with non-integer values and less than required
partitions values", {
+ skip_if_not_available("parquet")
+ tmp <- tempfile()
+
+ # this example needs 3 partitions
+
+ # max_partitions = chr => error
+ expect_error(
+ mtcars %>%
+ group_by(cyl) %>%
+ write_dataset(tmp, format = "parquet", max_partitions = "foobar")
+ )
+
+ # max_partitions < 3 => error
+ expect_error(
+ mtcars %>%
+ group_by(cyl) %>%
+ write_dataset(tmp, format = "parquet", max_partitions = -3)
+ )
+
+ # max_partitions < 3 => error
+ expect_error(
+ mtcars %>%
+ group_by(cyl) %>%
+ write_dataset(tmp, format = "parquet", max_partitions = 1)
+ )
Review comment:
We especially want to make sure that this error is clear + actionable
##########
File path: r/R/dataset-write.R
##########
@@ -60,8 +62,13 @@ write_dataset <- function(dataset,
format = c("parquet", "feather", "arrow", "ipc"),
partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.",
as.character(format)),
- hive_style = TRUE,
+ hive_style = TRUE, max_partitions = 1024L,
Review comment:
Minor: in the .R code, we should follow the style here with each
argument on a new line.
##########
File path: r/R/dataset-write.R
##########
@@ -60,8 +62,13 @@ write_dataset <- function(dataset,
format = c("parquet", "feather", "arrow", "ipc"),
partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.",
as.character(format)),
- hive_style = TRUE,
+ hive_style = TRUE, max_partitions = 1024L,
...) {
+ stopifnot(
+ max_partitions == round(max_partitions, 0),
+ max_partitions == abs(max_partitions),
+ !is.null(max_partitions)
+ )
Review comment:
Have you tried to leave this checking off and seen what errors the c++
code returns? If those errors are reasonable, we should use them instead of
writing our own here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]