[GitHub] [arrow] thisisnic commented on a diff in pull request #36436: GH-36247: [R] Add write_csv_dataset

via GitHub Thu, 03 Aug 2023 14:37:10 -0700


thisisnic commented on code in PR #36436:
URL: https://github.com/apache/arrow/pull/36436#discussion_r1283726224



##########
r/R/csv.R:
##########
@@ -510,32 +510,55 @@ CsvReadOptions$create <- function(use_threads = 
option_use_threads(),
   options
 }
 
-readr_to_csv_write_options <- function(include_header = TRUE,
+readr_to_csv_write_options <- function(col_names = TRUE,
                                        batch_size = 1024L,
-                                       na = "") {
+                                       delim = ",",
+                                       na = "",
+                                       eol = "\n",
+                                       quote = "Needed") {

Review Comment:
   The valid values for `quote` in `readr::write_delim()` are `("needed", 
"all", "none")`, so do we need to allow those values here and then convert them 
to the Arrow-valid version in the body of this function?



##########
r/R/dataset-write.R:
##########
@@ -209,6 +228,202 @@ write_dataset <- function(dataset,
   )
 }
 
+#' Write a dataset into partitioned flat files.
+#'
+#' The `write_*_dataset()` are a family of wrappers around [write_dataset] to 
allow for easy switching
+#' between functions for writing datasets.
+#'
+#' @param dataset [Dataset], [RecordBatch], [Table], `arrow_dplyr_query`, or
+#' `data.frame`. If an `arrow_dplyr_query`, the query will be evaluated and
+#' the result will be written. This means that you can `select()`, `filter()`, 
`mutate()`,
+#' etc. to transform the data before it is written if you need to.
+#' @param path String path, URI, or `SubTreeFileSystem` referencing a directory
+#' to write to (directory will be created if it does not exist)
+#' @param partitioning `Partitioning` or a character vector of columns to
+#' use as partition keys (to be written as path segments). Default is to
+#' use the current `group_by()` columns.
+#' @param basename_template String template for the names of files to be 
written.
+#' Must contain `"{i}"`, which will be replaced with an autoincremented
+#' integer to generate basenames of datafiles. For example, `"part-{i}.csv"`
+#' will yield `"part-0.csv", ...`.
+#' If not specified, it defaults to `"part-{i}.csv"`.
+#' @param hive_style Write partition segments as Hive-style
+#' (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is 
`TRUE`.
+#' @param existing_data_behavior The behavior to use when there is already data
+#' in the destination directory.  Must be one of "overwrite", "error", or
+#' "delete_matching".
+#' - `overwrite` (the default) then any new files created will overwrite
+#'   existing files
+#' - `error` then the operation will fail if the destination directory is not
+#'   empty
+#' - `delete_matching` then the writer will delete any existing partitions
+#'   if data is going to be written to those partitions and will leave alone
+#'   partitions which data is not written to.
+#' @param max_partitions Maximum number of partitions any batch may be
+#' written into. Default is 1024L.
+#' @param max_open_files Maximum number of files that can be left opened
+#' during a write operation. If greater than 0 then this will limit the
+#' maximum number of files that can be left open. If an attempt is made to open
+#' too many files then the least recently used file will be closed.
+#' If this setting is set too low you may end up fragmenting your data
+#' into many small files. The default is 900 which also allows some # of files 
to be
+#' open by the scanner before hitting the default Linux limit of 1024.
+#' @param max_rows_per_file Maximum number of rows per file.
+#' If greater than 0 then this will limit how many rows are placed in any 
single file.
+#' Default is 0L.
+#' @param min_rows_per_group Write the row groups to the disk when this number 
of
+#' rows have accumulated. Default is 0L.
+#' @param max_rows_per_group Maximum rows allowed in a single
+#' group and when this number of rows is exceeded, it is split and the next set
+#' of rows is written to the next group. This value must be set such that it is
+#' greater than `min_rows_per_group`. Default is 1024 * 1024.
+#' @param col_names Whether to write an initial header line with column names.
+#' @param batch_size Maximum number of rows processed at a time. Default is 
1024L.
+#' @param delim Delimiter used to separate values. Defaults to `","` for 
`write_delim_dataset()` and
+#' `write_csv_dataset()`, and `"\t` for `write_tsv_dataset()`. Cannot be 
changed for `write_tsv_dataset()`.
+#' @param na a character vector of strings to interpret as missing values. 
Quotes are not allowed in this string.
+#' The default is an empty string `""`.
+#' @param eol the end of line character to use for ending rows. The default is 
`"\n"`.
+#' @param quote How to handle fields which contain characters that need to be 
quoted.
+#' - `Needed` - Only enclose values in quotes which need them, because their 
CSV rendering can
+#'  contain quotes itself (e.g. strings or binary values) (the default)
+#' - `AllValid` -   Enclose all valid values in quotes. Nulls are not quoted. 
May cause readers to
+#' interpret all values as strings if schema is inferred.
+#' - `None` -   Do not enclose any values in quotes. Prevents values from 
containing quotes ("),
+#' cell delimiters (,) or line endings (\\r, \\n), (following RFC4180). If 
values
+#' contain these characters, an error is caused when attempting to write.
+#' @return The input `dataset`, invisibly.
+#'
+#' @seealso [write_dataset()]
+#' @export
+write_delim_dataset <- function(dataset,
+                                path,
+                                partitioning = dplyr::group_vars(dataset),
+                                basename_template = "part-{i}.txt",
+                                hive_style = TRUE,
+                                existing_data_behavior = c("overwrite", 
"error", "delete_matching"),
+                                max_partitions = 1024L,
+                                max_open_files = 900L,
+                                max_rows_per_file = 0L,
+                                min_rows_per_group = 0L,
+                                max_rows_per_group = bitwShiftL(1, 20),
+                                col_names = TRUE,
+                                batch_size = 1024L,
+                                delim = ",",
+                                na = "",
+                                eol = "\n",
+                                quote = "Needed") {
+  if (!missing(max_rows_per_file) && missing(max_rows_per_group) && 
max_rows_per_group > max_rows_per_file) {
+    max_rows_per_group <- max_rows_per_file
+  }
+  write_dataset(
+    dataset = dataset,
+    path = path,
+    format = "txt",
+    partitioning = partitioning,
+    basename_template = basename_template,
+    hive_style = hive_style,
+    existing_data_behavior = existing_data_behavior,
+    max_partitions = max_partitions,
+    max_open_files = max_open_files,
+    max_rows_per_file = max_rows_per_file,
+    min_rows_per_group = min_rows_per_group,
+    max_rows_per_group = max_rows_per_group,
+    include_header = col_names,
+    batch_size = batch_size,
+    delimiter = delim,
+    null_string = na,
+    eol = eol,
+    quoting_style = quote
+  )
+}
+
+#' @rdname write_delim_dataset
+#' @export
+write_csv_dataset <- function(dataset,
+                              path,
+                              partitioning = dplyr::group_vars(dataset),
+                              basename_template = "part-{i}.csv",
+                              hive_style = TRUE,
+                              existing_data_behavior = c("overwrite", "error", 
"delete_matching"),
+                              max_partitions = 1024L,
+                              max_open_files = 900L,
+                              max_rows_per_file = 0L,
+                              min_rows_per_group = 0L,
+                              max_rows_per_group = bitwShiftL(1, 20),
+                              col_names = TRUE,
+                              batch_size = 1024L,
+                              delim = ",",
+                              na = "",
+                              eol = "\n",
+                              quote = "Needed") {
+  if (!missing(max_rows_per_file) && missing(max_rows_per_group) && 
max_rows_per_group > max_rows_per_file) {
+    max_rows_per_group <- max_rows_per_file
+  }
+  write_dataset(
+    dataset = dataset,
+    path = path,
+    format = "csv",
+    partitioning = partitioning,
+    basename_template = basename_template,
+    hive_style = hive_style,
+    existing_data_behavior = existing_data_behavior,
+    max_partitions = max_partitions,
+    max_open_files = max_open_files,
+    max_rows_per_file = max_rows_per_file,
+    min_rows_per_group = min_rows_per_group,
+    max_rows_per_group = max_rows_per_group,
+    include_header = col_names,
+    batch_size = batch_size,
+    delimiter = delim,
+    null_string = na,
+    eol = eol,
+    quoting_style = quote
+  )
+}
+
+#' @rdname write_delim_dataset
+#' @export
+write_tsv_dataset <- function(dataset,
+                              path,
+                              partitioning = dplyr::group_vars(dataset),
+                              basename_template = "part-{i}.tsv",
+                              hive_style = TRUE,
+                              existing_data_behavior = c("overwrite", "error", 
"delete_matching"),
+                              max_partitions = 1024L,
+                              max_open_files = 900L,
+                              max_rows_per_file = 0L,
+                              min_rows_per_group = 0L,
+                              max_rows_per_group = bitwShiftL(1, 20),
+                              col_names = TRUE,
+                              batch_size = 1024L,
+                              na = "",
+                              eol = "\n",
+                              quote = "Needed") {
+  if (!missing(max_rows_per_file) && missing(max_rows_per_group) && 
max_rows_per_group > max_rows_per_file) {
+    max_rows_per_group <- max_rows_per_file
+  }

Review Comment:
   Just to check, why do we need to replicate this logic here?



##########
r/R/dataset-write.R:
##########
@@ -209,6 +228,202 @@ write_dataset <- function(dataset,
   )
 }
 
+#' Write a dataset into partitioned flat files.
+#'
+#' The `write_*_dataset()` are a family of wrappers around [write_dataset] to 
allow for easy switching
+#' between functions for writing datasets.
+#'
+#' @param dataset [Dataset], [RecordBatch], [Table], `arrow_dplyr_query`, or
+#' `data.frame`. If an `arrow_dplyr_query`, the query will be evaluated and
+#' the result will be written. This means that you can `select()`, `filter()`, 
`mutate()`,
+#' etc. to transform the data before it is written if you need to.
+#' @param path String path, URI, or `SubTreeFileSystem` referencing a directory
+#' to write to (directory will be created if it does not exist)
+#' @param partitioning `Partitioning` or a character vector of columns to
+#' use as partition keys (to be written as path segments). Default is to
+#' use the current `group_by()` columns.
+#' @param basename_template String template for the names of files to be 
written.
+#' Must contain `"{i}"`, which will be replaced with an autoincremented
+#' integer to generate basenames of datafiles. For example, `"part-{i}.csv"`
+#' will yield `"part-0.csv", ...`.
+#' If not specified, it defaults to `"part-{i}.csv"`.
+#' @param hive_style Write partition segments as Hive-style
+#' (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is 
`TRUE`.
+#' @param existing_data_behavior The behavior to use when there is already data
+#' in the destination directory.  Must be one of "overwrite", "error", or
+#' "delete_matching".
+#' - `overwrite` (the default) then any new files created will overwrite
+#'   existing files
+#' - `error` then the operation will fail if the destination directory is not
+#'   empty
+#' - `delete_matching` then the writer will delete any existing partitions
+#'   if data is going to be written to those partitions and will leave alone
+#'   partitions which data is not written to.
+#' @param max_partitions Maximum number of partitions any batch may be
+#' written into. Default is 1024L.
+#' @param max_open_files Maximum number of files that can be left opened
+#' during a write operation. If greater than 0 then this will limit the
+#' maximum number of files that can be left open. If an attempt is made to open
+#' too many files then the least recently used file will be closed.
+#' If this setting is set too low you may end up fragmenting your data
+#' into many small files. The default is 900 which also allows some # of files 
to be
+#' open by the scanner before hitting the default Linux limit of 1024.
+#' @param max_rows_per_file Maximum number of rows per file.
+#' If greater than 0 then this will limit how many rows are placed in any 
single file.
+#' Default is 0L.
+#' @param min_rows_per_group Write the row groups to the disk when this number 
of
+#' rows have accumulated. Default is 0L.
+#' @param max_rows_per_group Maximum rows allowed in a single
+#' group and when this number of rows is exceeded, it is split and the next set
+#' of rows is written to the next group. This value must be set such that it is
+#' greater than `min_rows_per_group`. Default is 1024 * 1024.
+#' @param col_names Whether to write an initial header line with column names.
+#' @param batch_size Maximum number of rows processed at a time. Default is 
1024L.
+#' @param delim Delimiter used to separate values. Defaults to `","` for 
`write_delim_dataset()` and
+#' `write_csv_dataset()`, and `"\t` for `write_tsv_dataset()`. Cannot be 
changed for `write_tsv_dataset()`.
+#' @param na a character vector of strings to interpret as missing values. 
Quotes are not allowed in this string.
+#' The default is an empty string `""`.
+#' @param eol the end of line character to use for ending rows. The default is 
`"\n"`.
+#' @param quote How to handle fields which contain characters that need to be 
quoted.
+#' - `Needed` - Only enclose values in quotes which need them, because their 
CSV rendering can
+#'  contain quotes itself (e.g. strings or binary values) (the default)
+#' - `AllValid` -   Enclose all valid values in quotes. Nulls are not quoted. 
May cause readers to
+#' interpret all values as strings if schema is inferred.
+#' - `None` -   Do not enclose any values in quotes. Prevents values from 
containing quotes ("),
+#' cell delimiters (,) or line endings (\\r, \\n), (following RFC4180). If 
values
+#' contain these characters, an error is caused when attempting to write.

Review Comment:
   Can you use `#' @inheritParams write_dataset` here to save copying some of 
this?



##########
r/R/csv.R:
##########
@@ -510,32 +510,55 @@ CsvReadOptions$create <- function(use_threads = 
option_use_threads(),
   options
 }
 
-readr_to_csv_write_options <- function(include_header = TRUE,
+readr_to_csv_write_options <- function(col_names = TRUE,
                                        batch_size = 1024L,
-                                       na = "") {
+                                       delim = ",",
+                                       na = "",
+                                       eol = "\n",
+                                       quote = "Needed") {
   CsvWriteOptions$create(
-    include_header = include_header,
+    include_header = col_names,
     batch_size = batch_size,
-    null_string = na
+    delimiter = delim,
+    null_string = na,
+    eol = eol,
+    quoting_style = quote
   )
 }
 
 #' @rdname CsvReadOptions
 #' @export
 CsvWriteOptions <- R6Class("CsvWriteOptions", inherit = ArrowObject)
-CsvWriteOptions$create <- function(include_header = TRUE, batch_size = 1024L, 
null_string = "") {
-  assert_that(is_integerish(batch_size, n = 1, finite = TRUE), batch_size > 0)
+CsvWriteOptions$create <- function(include_header = TRUE,
+                                   batch_size = 1024L,
+                                   null_string = "",
+                                   delimiter = ",",
+                                   eol = "\n",
+                                   quoting_style = c("Needed", "AllValid", 
"None")) {
+  if (missing(quoting_style)) {
+    quoting_style <- "Needed"
+  }
+  quoting_style_opts <- c("Needed", "AllValid", "None")

Review Comment:
   I think here, you can use `match.arg()` instead



##########
r/R/dataset-write.R:
##########
@@ -209,6 +228,202 @@ write_dataset <- function(dataset,
   )
 }
 
+#' Write a dataset into partitioned flat files.
+#'
+#' The `write_*_dataset()` are a family of wrappers around [write_dataset] to 
allow for easy switching
+#' between functions for writing datasets.
+#'
+#' @param dataset [Dataset], [RecordBatch], [Table], `arrow_dplyr_query`, or
+#' `data.frame`. If an `arrow_dplyr_query`, the query will be evaluated and
+#' the result will be written. This means that you can `select()`, `filter()`, 
`mutate()`,
+#' etc. to transform the data before it is written if you need to.
+#' @param path String path, URI, or `SubTreeFileSystem` referencing a directory
+#' to write to (directory will be created if it does not exist)
+#' @param partitioning `Partitioning` or a character vector of columns to
+#' use as partition keys (to be written as path segments). Default is to
+#' use the current `group_by()` columns.
+#' @param basename_template String template for the names of files to be 
written.
+#' Must contain `"{i}"`, which will be replaced with an autoincremented
+#' integer to generate basenames of datafiles. For example, `"part-{i}.csv"`
+#' will yield `"part-0.csv", ...`.
+#' If not specified, it defaults to `"part-{i}.csv"`.
+#' @param hive_style Write partition segments as Hive-style
+#' (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is 
`TRUE`.
+#' @param existing_data_behavior The behavior to use when there is already data
+#' in the destination directory.  Must be one of "overwrite", "error", or
+#' "delete_matching".
+#' - `overwrite` (the default) then any new files created will overwrite
+#'   existing files
+#' - `error` then the operation will fail if the destination directory is not
+#'   empty
+#' - `delete_matching` then the writer will delete any existing partitions
+#'   if data is going to be written to those partitions and will leave alone
+#'   partitions which data is not written to.
+#' @param max_partitions Maximum number of partitions any batch may be
+#' written into. Default is 1024L.
+#' @param max_open_files Maximum number of files that can be left opened
+#' during a write operation. If greater than 0 then this will limit the
+#' maximum number of files that can be left open. If an attempt is made to open
+#' too many files then the least recently used file will be closed.
+#' If this setting is set too low you may end up fragmenting your data
+#' into many small files. The default is 900 which also allows some # of files 
to be
+#' open by the scanner before hitting the default Linux limit of 1024.
+#' @param max_rows_per_file Maximum number of rows per file.
+#' If greater than 0 then this will limit how many rows are placed in any 
single file.
+#' Default is 0L.
+#' @param min_rows_per_group Write the row groups to the disk when this number 
of
+#' rows have accumulated. Default is 0L.
+#' @param max_rows_per_group Maximum rows allowed in a single
+#' group and when this number of rows is exceeded, it is split and the next set
+#' of rows is written to the next group. This value must be set such that it is
+#' greater than `min_rows_per_group`. Default is 1024 * 1024.
+#' @param col_names Whether to write an initial header line with column names.
+#' @param batch_size Maximum number of rows processed at a time. Default is 
1024L.
+#' @param delim Delimiter used to separate values. Defaults to `","` for 
`write_delim_dataset()` and
+#' `write_csv_dataset()`, and `"\t` for `write_tsv_dataset()`. Cannot be 
changed for `write_tsv_dataset()`.
+#' @param na a character vector of strings to interpret as missing values. 
Quotes are not allowed in this string.
+#' The default is an empty string `""`.
+#' @param eol the end of line character to use for ending rows. The default is 
`"\n"`.
+#' @param quote How to handle fields which contain characters that need to be 
quoted.
+#' - `Needed` - Only enclose values in quotes which need them, because their 
CSV rendering can
+#'  contain quotes itself (e.g. strings or binary values) (the default)
+#' - `AllValid` -   Enclose all valid values in quotes. Nulls are not quoted. 
May cause readers to
+#' interpret all values as strings if schema is inferred.
+#' - `None` -   Do not enclose any values in quotes. Prevents values from 
containing quotes ("),
+#' cell delimiters (,) or line endings (\\r, \\n), (following RFC4180). If 
values
+#' contain these characters, an error is caused when attempting to write.
+#' @return The input `dataset`, invisibly.
+#'
+#' @seealso [write_dataset()]
+#' @export
+write_delim_dataset <- function(dataset,
+                                path,
+                                partitioning = dplyr::group_vars(dataset),
+                                basename_template = "part-{i}.txt",
+                                hive_style = TRUE,
+                                existing_data_behavior = c("overwrite", 
"error", "delete_matching"),
+                                max_partitions = 1024L,
+                                max_open_files = 900L,
+                                max_rows_per_file = 0L,
+                                min_rows_per_group = 0L,
+                                max_rows_per_group = bitwShiftL(1, 20),
+                                col_names = TRUE,
+                                batch_size = 1024L,
+                                delim = ",",
+                                na = "",
+                                eol = "\n",
+                                quote = "Needed") {

Review Comment:
   This makes sense, as we are using the existing `write_dataset()` parameters 
followed by the `readr::write_delim()` ones, the same as we did for 
`open_delim_dataset()`.  
   
   We might need to update the value of `quote` here too, to match the 
`readr::write_delim()` ones. Possible the same with `na`.
   
   I was wondering if we should be enabling `col_types` by passing in a schema, 
but this isn't a valid parameter in `write_dataset()`, so we shouldn't here.
   



##########
r/R/csv.R:
##########
@@ -510,32 +510,55 @@ CsvReadOptions$create <- function(use_threads = 
option_use_threads(),
   options
 }
 
-readr_to_csv_write_options <- function(include_header = TRUE,
+readr_to_csv_write_options <- function(col_names = TRUE,
                                        batch_size = 1024L,
-                                       na = "") {
+                                       delim = ",",
+                                       na = "",
+                                       eol = "\n",
+                                       quote = "Needed") {
   CsvWriteOptions$create(
-    include_header = include_header,
+    include_header = col_names,
     batch_size = batch_size,
-    null_string = na
+    delimiter = delim,
+    null_string = na,
+    eol = eol,
+    quoting_style = quote
   )
 }
 
 #' @rdname CsvReadOptions
 #' @export
 CsvWriteOptions <- R6Class("CsvWriteOptions", inherit = ArrowObject)
-CsvWriteOptions$create <- function(include_header = TRUE, batch_size = 1024L, 
null_string = "") {
-  assert_that(is_integerish(batch_size, n = 1, finite = TRUE), batch_size > 0)
+CsvWriteOptions$create <- function(include_header = TRUE,
+                                   batch_size = 1024L,
+                                   null_string = "",
+                                   delimiter = ",",
+                                   eol = "\n",
+                                   quoting_style = c("Needed", "AllValid", 
"None")) {
+  if (missing(quoting_style)) {
+    quoting_style <- "Needed"
+  }
+  quoting_style_opts <- c("Needed", "AllValid", "None")
+
   assert_that(is.logical(include_header))
+  assert_that(is_integerish(batch_size, n = 1, finite = TRUE), batch_size > 0)
+  assert_that(is.character(delimiter))
   assert_that(is.character(null_string))
   assert_that(!is.na(null_string))
   assert_that(length(null_string) == 1)
   assert_that(!grepl('"', null_string), msg = "na argument must not contain 
quote characters.")
+  assert_that(is.character(eol))
+  assert_that(quoting_style %in% quoting_style_opts, msg = "quoting_style must 
be 1 of 'Needed', 'AllValid' or 'None'")

Review Comment:
   Once you use `match.arg()` above, you won't need this check



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic commented on a diff in pull request #36436: GH-36247: [R] Add write_csv_dataset

Reply via email to