ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554271178
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
filter(integer > 6) %>%
summarize(mean = mean(integer))
)
+})
+
+test_that("readr parse options", {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+
+ # Arrow and readr options are mutually exclusive
+ expect_equal(
Review comment:
The code in `csv_file_format_parse_options()` in `dataset-format.R` will
error or behave incorrectly if any of the Arrow file parsing options
(`delimiter`, `quoting`, etc.) happens to have exactly the same name as any of
the supported readr-style options (`delim`, `quote`, etc.). Some future change
might accidentally cause these two lists of option names to overlap, so I added
this test which will fail if that happens.
##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts =
csv_file_format_parse_options(...))
}
csv_file_format_parse_options <- function(...) {
+ opt_names <- names(list(...))
# Support both the readr spelling of options and the arrow spelling
- readr_opts <- c("delim", "quote", "escape_double", "escape_backslash",
"skip_empty_rows")
- if (any(readr_opts %in% names(list(...)))) {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+ is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+ is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+ bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+ if (length(bad_opts)) {
+ stop("Unsupported options: ",
+ paste(bad_opts, collapse = ", "),
+ call. = FALSE)
+ }
+ is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))
Review comment:
`arrow_opts` is a vector of the names of the allowed arguments to
`CsvParseOptions$create()` (Arrow-style arguments). `readr_opts` is a vector of
the names of the allowed arguments to `readr_to_csv_parse_options()`
(readr-style arguments).
```r
arrow_opts
## [1] "delimiter" "quoting" "quote_char"
"double_quote" "escaping"
## [6] "escape_char" "newlines_in_values" "ignore_empty_lines"
readr_opts
## [1] "delim" "quote" "escape_double"
"escape_backslash" "skip_empty_rows"
```
The function we're inside here (`csv_file_format_parse_options()`) allows
_either_ of these sets of arguments. These two sets of argument names are
mutually exclusive, but R's partial matching of argument names throws a wrench
in that. For example, if someone shortens either `delimiter` or `delim` to just
`del`, that would work fine in a function that accepts _only_ Arrow-style
arguments or _only_ readr-style options, but here it creates ambiguity—we can't
tell if the user is intending to specify Arrow-style arguments or readr-style
arguments.
```r
open_dataset("/path/to/csv/", format = "csv", del = ";")
## đź’€
```
So `pmatch()` to the rescue. `pmatch()`, when called like it is here, uses
the same algorithm for partial matching that R uses to identify named arguments
in function calls. `pmatch(x, y)` returns a vector of the same length as `x`,
and in each position, the value will be `NA` if and only if the character
string in that position in `x` _cannot_ be unambiguously matched to exactly one
character string in `y`. So if there are any `NA` values in the vector returned
by `pmatch()`, that means at least one of the argument names is ambiguous.
##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts =
csv_file_format_parse_options(...))
}
csv_file_format_parse_options <- function(...) {
+ opt_names <- names(list(...))
# Support both the readr spelling of options and the arrow spelling
- readr_opts <- c("delim", "quote", "escape_double", "escape_backslash",
"skip_empty_rows")
- if (any(readr_opts %in% names(list(...)))) {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+ is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+ is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+ bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+ if (length(bad_opts)) {
+ stop("Unsupported options: ",
+ paste(bad_opts, collapse = ", "),
Review comment:
Oh nice. Do we have any go-to functions for handling singular/plural (so
that the "s" in "options" could be conditionally excluded/included)?
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
filter(integer > 6) %>%
summarize(mean = mean(integer))
)
+})
+
+test_that("readr parse options", {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+
+ # Arrow and readr options are mutually exclusive
+ expect_equal(
+ intersect(arrow_opts, readr_opts),
+ character(0)
+ )
- # Now with readr option spelling (and omitting format = "text")
- ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+ # With unsupported readr parse options
+ # (remove this after ARROW-8631)
+ if (!"na" %in% readr_opts) {
Review comment:
Right. If you think it's worth it, I'll add some additional code here to
get the full list of rlang options that `read_delim_arrow()`supports and
compute a set difference with the names of the rlang options currently
supported here. Then in the case where the user specifies one of those options,
I'll give a more specific error message.
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
filter(integer > 6) %>%
summarize(mean = mean(integer))
)
+})
+
+test_that("readr parse options", {
+ arrow_opts <- names(formals(CsvParseOptions$create))
+ readr_opts <- names(formals(readr_to_csv_parse_options))
+
+ # Arrow and readr options are mutually exclusive
+ expect_equal(
+ intersect(arrow_opts, readr_opts),
+ character(0)
+ )
- # Now with readr option spelling (and omitting format = "text")
- ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+ # With unsupported readr parse options
+ # (remove this after ARROW-8631)
+ if (!"na" %in% readr_opts) {
Review comment:
Right. If you think it's worth it, I'll add some additional code here to
get the full list of rlang options that `read_delim_arrow()` supports and
compute a set difference with the names of the rlang options currently
supported here. Then in the case where the user specifies one of those options,
I'll give a more specific error message.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]