nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554241301
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
filter(integer > 6) %>%
summarize(mean = mean(integer))
)
+})
+
+test_that("readr parse options", {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+
+ # Arrow and readr options are mutually exclusive
+ expect_equal(
+ intersect(arrow_opts, readr_opts),
+ character(0)
+ )
- # Now with readr option spelling (and omitting format = "text")
- ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+ # With unsupported readr parse options
+ # (remove this after ARROW-8631)
+ if (!"na" %in% readr_opts) {
Review comment:
If I read the code correctly, we're not distinguishing between arguments
that are valid CsvReadOptions/CsvConvertOptions and just not supported in
datasets yet (what ARROW-8631 is about) and plain garbage `asdfasdrfwea`
options. Is that distinction worth making? I.e. is erroring with `Unsupported
option: na` clear enough that we're not going to get another bug report about
this?
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
filter(integer > 6) %>%
summarize(mean = mean(integer))
)
+})
+
+test_that("readr parse options", {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+
+ # Arrow and readr options are mutually exclusive
+ expect_equal(
Review comment:
Why does this need to be asserted? Because the validating code assumes
it?
##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts =
csv_file_format_parse_options(...))
}
csv_file_format_parse_options <- function(...) {
+ opt_names <- names(list(...))
# Support both the readr spelling of options and the arrow spelling
- readr_opts <- c("delim", "quote", "escape_double", "escape_backslash",
"skip_empty_rows")
- if (any(readr_opts %in% names(list(...)))) {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+ is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+ is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+ bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+ if (length(bad_opts)) {
+ stop("Unsupported options: ",
+ paste(bad_opts, collapse = ", "),
Review comment:
You don't have to use it here, but for reference we have an
`oxford_paste` utility function for making human-readable, appropriately
comma'd error message strings:
https://github.com/apache/arrow/blob/master/r/R/util.R#L18
##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts =
csv_file_format_parse_options(...))
}
csv_file_format_parse_options <- function(...) {
+ opt_names <- names(list(...))
# Support both the readr spelling of options and the arrow spelling
- readr_opts <- c("delim", "quote", "escape_double", "escape_backslash",
"skip_empty_rows")
- if (any(readr_opts %in% names(list(...)))) {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+ is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+ is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+ bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+ if (length(bad_opts)) {
+ stop("Unsupported options: ",
+ paste(bad_opts, collapse = ", "),
+ call. = FALSE)
+ }
+ is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))
Review comment:
Could you add a comment here explaining how this catches ambiguous args?
I don't use `pmatch` enough for this to be immediately apparent.
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
filter(integer > 6) %>%
summarize(mean = mean(integer))
)
+})
+
+test_that("readr parse options", {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+
+ # Arrow and readr options are mutually exclusive
+ expect_equal(
Review comment:
Ok, that's what I figured. Could you please add a comment here
explaining that?
##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts =
csv_file_format_parse_options(...))
}
csv_file_format_parse_options <- function(...) {
+ opt_names <- names(list(...))
# Support both the readr spelling of options and the arrow spelling
- readr_opts <- c("delim", "quote", "escape_double", "escape_backslash",
"skip_empty_rows")
- if (any(readr_opts %in% names(list(...)))) {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+ is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+ is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+ bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+ if (length(bad_opts)) {
+ stop("Unsupported options: ",
+ paste(bad_opts, collapse = ", "),
Review comment:
I don't think we do, though I've definitely written various `pluralize`
util functions in the past, and we could add one here (if a dependency doesn't
already provide one; we should check `asserthat` and `rlang`)
##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts =
csv_file_format_parse_options(...))
}
csv_file_format_parse_options <- function(...) {
+ opt_names <- names(list(...))
# Support both the readr spelling of options and the arrow spelling
- readr_opts <- c("delim", "quote", "escape_double", "escape_backslash",
"skip_empty_rows")
- if (any(readr_opts %in% names(list(...)))) {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+ is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+ is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+ bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+ if (length(bad_opts)) {
+ stop("Unsupported options: ",
+ paste(bad_opts, collapse = ", "),
+ call. = FALSE)
+ }
+ is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))
Review comment:
Got it. So this works because we've already filtered out the case where
there's no match in each list separately, so if there's no unambiguous match
when together, it must mean that there were multiple matches.
Can you add an inline comment to that effect?
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
filter(integer > 6) %>%
summarize(mean = mean(integer))
)
+})
+
+test_that("readr parse options", {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+
+ # Arrow and readr options are mutually exclusive
+ expect_equal(
+ intersect(arrow_opts, readr_opts),
+ character(0)
+ )
- # Now with readr option spelling (and omitting format = "text")
- ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+ # With unsupported readr parse options
+ # (remove this after ARROW-8631)
+ if (!"na" %in% readr_opts) {
Review comment:
Yeah, I think we need to distinguish in the error message between "your
fault" and "our fault/not implemented"--what you as a user do with that
information is very different.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]