[GitHub] [arrow] jonkeane commented on a change in pull request #12185: ARROW-15020: [R] Add bindings for new dataset writing options

GitBox Fri, 28 Jan 2022 07:54:29 -0800


jonkeane commented on a change in pull request #12185:
URL: https://github.com/apache/arrow/pull/12185#discussion_r794606763




##########
File path: r/R/dataset-write.R
##########
@@ -111,6 +121,10 @@ write_dataset <- function(dataset,
                           hive_style = TRUE,
                           existing_data_behavior = c("overwrite", "error", 
"delete_matching"),
                           max_partitions = 1024L,
+                          max_open_files = 900L,
+                          max_rows_per_file = 0L,
+                          min_rows_per_group = 0L, 

Review comment:
       ```suggestion
                             min_rows_per_group = 0L,
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -50,6 +50,16 @@
 #'   partitions which data is not written to.
 #' @param max_partitions maximum number of partitions any batch may be
 #' written into. Default is 1024L.
+#' @param max_open_files maximum number of files that can be left opened
+#' during a write operation. Default is 900L.
+#' @param max_rows_per_file maximum limit how many rows are placed in 
+#' any single file
+#' @param min_rows_per_group write the row groups to the disk when sufficient 

Review comment:
       ```suggestion
   #' @param min_rows_per_group write the row groups to the disk when this 
number of
   ```

##########
File path: r/tests/testthat/test-dataset-write.R
##########
@@ -505,3 +505,145 @@ test_that("Max partitions fails with non-integer values 
and less than required p
     "max_partitions must be a positive, non-missing integer"
   )
 })
+
+get_num_of_files = function(dir, format) {
+    files <-  list.files(dir, pattern = paste('.', format, sep=""), 

Review comment:
       ```suggestion
       files <- list.files(dir, pattern = paste(".", format, sep = ""), 
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -50,6 +50,16 @@
 #'   partitions which data is not written to.
 #' @param max_partitions maximum number of partitions any batch may be
 #' written into. Default is 1024L.
+#' @param max_open_files maximum number of files that can be left opened
+#' during a write operation. Default is 900L.
+#' @param max_rows_per_file maximum limit how many rows are placed in 
+#' any single file

Review comment:
       ```suggestion
   #' @param max_rows_per_file maximum number of rows to be placed in
   #' any single file.
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -146,6 +160,8 @@ write_dataset <- function(dataset,
   dataset___Dataset__Write(
     options, path_and_fs$fs, path_and_fs$path,
     partitioning, basename_template, scanner,
-    existing_data_behavior, max_partitions
+    existing_data_behavior, max_partitions,
+    max_open_files, max_rows_per_file, 

Review comment:
       ```suggestion
       max_open_files, max_rows_per_file,
   ```

##########
File path: r/tests/testthat/test-dataset-write.R
##########
@@ -505,3 +505,145 @@ test_that("Max partitions fails with non-integer values 
and less than required p
     "max_partitions must be a positive, non-missing integer"
   )
 })
+
+get_num_of_files = function(dir, format) {

Review comment:
       ```suggestion
   get_num_of_files <- function(dir, format) {
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -111,6 +121,10 @@ write_dataset <- function(dataset,
                           hive_style = TRUE,
                           existing_data_behavior = c("overwrite", "error", 
"delete_matching"),
                           max_partitions = 1024L,
+                          max_open_files = 900L,

Review comment:
       Is `900L` here a default in C++? I'm curious where that specific number 
came from... (it's close to `1024L` or `1000L`, but curious why it's lower?)

##########
File path: r/tests/testthat/test-dataset-write.R
##########
@@ -505,3 +505,145 @@ test_that("Max partitions fails with non-integer values 
and less than required p
     "max_partitions must be a positive, non-missing integer"
   )
 })
+
+get_num_of_files = function(dir, format) {
+    files <-  list.files(dir, pattern = paste('.', format, sep=""), 
+    all.files = FALSE, recursive = TRUE, full.names = TRUE)
+    length(files)
+}
+
+test_that("Dataset write max open files", {
+  skip_if_not_available("parquet") 
+  # test default partitioning
+  dst_dir <- make_temp_dir()
+  file_format <- "parquet"
+  partitioning <- c("c2")
+  num_of_unique_c2_groups = 5
+
+  record_batch_1 <- record_batch(c1=c(1, 2, 3, 4, 0, 10),
+                                     c2=c('a', 'b', 'c', 'd', 'e', 'a'))

Review comment:
       Also a minor style note but `"` should generally be used over `'`

##########
File path: r/tests/testthat/test-dataset-write.R
##########
@@ -505,3 +505,145 @@ test_that("Max partitions fails with non-integer values 
and less than required p
     "max_partitions must be a positive, non-missing integer"
   )
 })
+
+get_num_of_files = function(dir, format) {
+    files <-  list.files(dir, pattern = paste('.', format, sep=""), 
+    all.files = FALSE, recursive = TRUE, full.names = TRUE)
+    length(files)
+}
+
+test_that("Dataset write max open files", {
+  skip_if_not_available("parquet") 
+  # test default partitioning
+  dst_dir <- make_temp_dir()
+  file_format <- "parquet"
+  partitioning <- c("c2")
+  num_of_unique_c2_groups = 5
+
+  record_batch_1 <- record_batch(c1=c(1, 2, 3, 4, 0, 10),

Review comment:
       This is minor, but typically in R we put spaces around `=`, so `c1=c(1, 
2, 3, 4, 0, 10)` should be `c1 = c(1, 2, 3, 4, 0, 10)`

##########
File path: r/R/dataset-write.R
##########
@@ -50,6 +50,16 @@
 #'   partitions which data is not written to.
 #' @param max_partitions maximum number of partitions any batch may be
 #' written into. Default is 1024L.
+#' @param max_open_files maximum number of files that can be left opened
+#' during a write operation. Default is 900L.
+#' @param max_rows_per_file maximum limit how many rows are placed in 
+#' any single file
+#' @param min_rows_per_group write the row groups to the disk when sufficient 
+#' rows have accumulated.
+#' @param max_rows_per_group maximum number of row groups allowed in a single 
+#' group and when the rows execeeds, it is splitted and excess is written to 
+#' another group. This value must be set such that it is greater than 
+#' min_rows_per_group. 

Review comment:
       ```suggestion
   #' @param max_rows_per_group maximum number of rows allowed in a single
   #' group and when this number of rows is exceeded, it is split and the next 
set
   #' of rows is written to the next group. This value must be set such that it 
is 
   #' greater than `min_rows_per_group`.
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -111,6 +121,10 @@ write_dataset <- function(dataset,
                           hive_style = TRUE,
                           existing_data_behavior = c("overwrite", "error", 
"delete_matching"),
                           max_partitions = 1024L,
+                          max_open_files = 900L,
+                          max_rows_per_file = 0L,

Review comment:
       It's a bit odd that the default `0L` here actually means no limit. In R 
I would expect either `NA`, `NULL`, or `Inf` to represent no limit.
   
   For example, when using `write_parquet()` `chunk_size` defaults to `NULL`:
   
   
https://github.com/apache/arrow/blob/07ec0a12d430dc9151678b6f00d5c6fc0598f034/r/R/parquet.R#L144




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on a change in pull request #12185: ARROW-15020: [R] Add bindings for new dataset writing options

Reply via email to