[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

GitBox Sat, 09 Jan 2021 20:19:12 -0800


ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554271178




##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       The code in `csv_file_format_parse_options()` in `dataset-format.R` will 
error or behave incorrectly if any of the Arrow file parsing options 
(`delimiter`, `quoting`, etc.) happens to have exactly the same name as any of 
the supported readr-style options (`delim`, `quote`, etc.). Some future change 
might accidentally cause these two lists of option names to overlap, so I added 
this test which will fail if that happens.

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = 
csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", 
"skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       `arrow_opts` is a vector of the names of the allowed arguments to 
`CsvParseOptions$create()` (Arrow-style arguments). `readr_opts` is a vector of 
the names of the allowed arguments to `readr_to_csv_parse_options()` 
(readr-style arguments).
   ```r
   arrow_opts
   ## [1] "delimiter"          "quoting"            "quote_char"         
"double_quote"       "escaping"          
   ## [6] "escape_char"        "newlines_in_values" "ignore_empty_lines"
   
   readr_opts
   ## [1] "delim"            "quote"            "escape_double"    
"escape_backslash" "skip_empty_rows"
   ```
   The function we're inside here (`csv_file_format_parse_options()`) allows 
_either_ of these sets of arguments. These two sets of argument names are 
mutually exclusive, but R's partial matching of argument names throws a wrench 
in that. For example, if someone shortens either `delimiter` or `delim` to just 
`del`, that would work fine in a function that accepts _only_ Arrow-style 
arguments or _only_ readr-style options, but here it creates ambiguity—we can't 
tell if the user is intending to specify Arrow-style arguments or readr-style 
arguments.
   
   ```r
   open_dataset("/path/to/csv/", format = "csv", del = ";")
   ## 💀
   ```
   So `pmatch()` to the rescue. `pmatch()`, when called like it is here, uses 
the same algorithm for partial matching that R uses to identify named arguments 
in function calls. `pmatch(x, y)` returns a vector of the same length as `x`, 
and in each position, the value will be `NA` if and only if the character 
string in that position in `x` _cannot_ be unambiguously matched to exactly one 
character string in `y`. So if there are any `NA` values in the vector returned 
by `pmatch()`, that means at least one of the argument names is ambiguous.

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = 
csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", 
"skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       Oh nice. Do we have any go-to functions for handling singular/plural (so 
that the "s" in "options" could be conditionally excluded/included)?

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Right. If you think it's worth it, I'll add some additional code here to 
get the full list of rlang options that `read_delim_arrow()`supports and 
compute a set difference with the names of the rlang options currently 
supported here. Then in the case where the user specifies one of those options, 
I'll give a more specific error message.

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Right. If you think it's worth it, I'll add some additional code here to 
get the full list of rlang options that `read_delim_arrow()` supports and 
compute a set difference with the names of the rlang options currently 
supported here. Then in the case where the user specifies one of those options, 
I'll give a more specific error message.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Reply via email to