[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

GitBox Sat, 09 Jan 2021 20:04:45 -0800


nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554241301




##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       If I read the code correctly, we're not distinguishing between arguments 
that are valid CsvReadOptions/CsvConvertOptions and just not supported in 
datasets yet (what ARROW-8631 is about) and plain garbage `asdfasdrfwea` 
options. Is that distinction worth making? I.e. is erroring with `Unsupported 
option: na` clear enough that we're not going to get another bug report about 
this?

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       Why does this need to be asserted? Because the validating code assumes 
it?

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = 
csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", 
"skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       You don't have to use it here, but for reference we have an 
`oxford_paste` utility function for making human-readable, appropriately 
comma'd error message strings: 
https://github.com/apache/arrow/blob/master/r/R/util.R#L18

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = 
csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", 
"skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       Could you add a comment here explaining how this catches ambiguous args? 
I don't use `pmatch` enough for this to be immediately apparent. 

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       Ok, that's what I figured. Could you please add a comment here 
explaining that?

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = 
csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", 
"skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       I don't think we do, though I've definitely written various `pluralize` 
util functions in the past, and we could add one here (if a dependency doesn't 
already provide one; we should check `asserthat` and `rlang`)

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = 
csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", 
"skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       Got it. So this works because we've already filtered out the case where 
there's no match in each list separately, so if there's no unambiguous match 
when together, it must mean that there were multiple matches.
   
   Can you add an inline comment to that effect? 

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Yeah, I think we need to distinguish in the error message between "your 
fault" and "our fault/not implemented"--what you as a user do with that 
information is very different.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Reply via email to