jonkeane commented on a change in pull request #11232:
URL: https://github.com/apache/arrow/pull/11232#discussion_r718651282
##########
File path: r/R/dplyr-functions.R
##########
@@ -330,6 +330,39 @@ arrow_string_join_function <- function(null_handling,
null_replacement = NULL) {
}
}
+# Currently, Arrow does not supports a locale option for string case conversion
+# functions, contrast to stringr's API, so the 'locale' argument is only valid
+# for stringr's default value ("en"). The following are string functions that
+# take a 'locale' option as its second argument:
+# str_to_lower
+# str_to_upper
+# str_to_title
+#
+# Arrow locale will be supported with ARROW-14126
+nse_funcs$str_to_lower <- function(string, locale = "en") {
+ if (!identical(locale, "en")) {
+ stop("Providing 'locale' to 'str_to_lower' is not supported in Arrow; ",
+ "to change locale use 'Sys.setlocale()'", call. = FALSE)
Review comment:
I like Nic's phrasing here, seeing it repeated three times: should we
write/store the message once and use it in each `stop` so we know the same is
being used. If we do want to keep the function name in it (I don't think we
need to, but if we do) we could make a small helper that plots that in.
##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -471,6 +471,46 @@ test_that("strsplit and str_split", {
)
})
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+ df <- tibble(x = c("Foo", " B\na R", "ⱭɽⱤoW", "ıI"))
Review comment:
I've triggered our as-cran build to check this — we might want (or in
fact need) to turn these non-ascii characters into their unicode escapes to not
anger cran checking (cf
https://github.com/apache/arrow/blob/master/r/tests/testthat/test-json.R#L26)
##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -471,6 +471,46 @@ test_that("strsplit and str_split", {
)
})
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+ df <- tibble(x = c("Foo", " B\na R", "ⱭɽⱤoW", "ıI"))
Review comment:
It did pass, though
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Encoding-issues
recommends using `\u...` for portability.
##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -467,6 +467,25 @@ test_that("strsplit and str_split", {
)
})
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+ df <- tibble(x = c("1Foo1", " \tB a R\n", "!apACHe aRroW!"))
+ funcs <- c(str_to_lower, str_to_upper, str_to_title)
+ for (func in funcs) {
+ expect_dplyr_equal(
+ input %>%
+ transmute(x = func(x)) %>%
+ collect(),
+ df
+ )
+
+ funcname = as.character(substitute(func))
+ expect_error(
+ nse_funcs[[funcname]]("Apache Arrow", locale = "sp"),
+ "Providing a value for 'locale' other than the default ('en') is not
supported by Arrow"
+ )
Review comment:
We could probably get away with testing the error on one function and
not wrapping them all in this for loop.
##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -467,6 +467,25 @@ test_that("strsplit and str_split", {
)
})
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+ df <- tibble(x = c("1Foo1", " \tB a R\n", "!apACHe aRroW!"))
+ funcs <- c(str_to_lower, str_to_upper, str_to_title)
+ for (func in funcs) {
+ expect_dplyr_equal(
+ input %>%
+ transmute(x = func(x)) %>%
Review comment:
This should work, but because of how these tests are evaluated, I think
something is getting lost leading to these errors:
https://github.com/apache/arrow/pull/11232/checks?check_run_id=3748190500#step:8:17492
##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -467,6 +467,25 @@ test_that("strsplit and str_split", {
)
})
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+ df <- tibble(x = c("1Foo1", " \tB a R\n", "!apACHe aRroW!"))
+ funcs <- c(str_to_lower, str_to_upper, str_to_title)
+ for (func in funcs) {
+ expect_dplyr_equal(
+ input %>%
+ transmute(x = func(x)) %>%
Review comment:
In other places, we have tested lists of functions like these by
transmuting/mutating multiple columns
https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dplyr.R#L996-L1009
something like:
```
expect_dplyr_equal(
input %>%
transmute(
x_lower = str_to_lower(x),
x_upper = str_to_upper(x),
x_title = str_to_title(x)
) %>%
collect()
```
##########
File path: r/R/dplyr-functions.R
##########
@@ -330,6 +330,35 @@ arrow_string_join_function <- function(null_handling,
null_replacement = NULL) {
}
}
+# Currently, Arrow does not supports a locale option for string case conversion
+# functions, contrast to stringr's API, so the 'locale' argument is only valid
+# for stringr's default value ("en"). The following are string functions that
+# take a 'locale' option as its second argument:
+# str_to_lower
+# str_to_upper
+# str_to_title
+#
+# Arrow locale will be supported with ARROW-14126
+.arrow_string_function_with_locale_arg <- function(func, string, locale) {
+ if (!identical(locale, "en")) {
+ stop("Providing a value for 'locale' other than the default ('en') is not
supported by Arrow. ",
+ "To change locale, use 'Sys.setlocale()'", call. = FALSE)
+ }
+ Expression$create(func, string)
+}
+
+nse_funcs$str_to_lower <- function(string, locale = "en") {
+ .arrow_string_function_with_locale_arg("utf8_lower", string, locale)
+}
+
+nse_funcs$str_to_upper <- function(string, locale = "en") {
+ .arrow_string_function_with_locale_arg("utf8_upper", string, locale)
+}
+
+nse_funcs$str_to_title <- function(string, locale = "en") {
+ .arrow_string_function_with_locale_arg("utf8_title", string, locale)
+}
Review comment:
Yes I agree, this also fits with other patterns we have of
stopping/bailing with validation function and then producing what's needed
otherwise.
##########
File path: r/R/dplyr-functions.R
##########
@@ -330,6 +330,35 @@ arrow_string_join_function <- function(null_handling,
null_replacement = NULL) {
}
}
+# Currently, Arrow does not supports a locale option for string case conversion
+# functions, contrast to stringr's API, so the 'locale' argument is only valid
+# for stringr's default value ("en"). The following are string functions that
+# take a 'locale' option as its second argument:
+# str_to_lower
+# str_to_upper
+# str_to_title
+#
+# Arrow locale will be supported with ARROW-14126
+.arrow_string_function_with_locale_arg <- function(func, string, locale) {
+ if (!identical(locale, "en")) {
+ stop("Providing a value for 'locale' other than the default ('en') is not
supported by Arrow. ",
+ "To change locale, use 'Sys.setlocale()'", call. = FALSE)
+ }
+ Expression$create(func, string)
+}
+
+nse_funcs$str_to_lower <- function(string, locale = "en") {
+ .arrow_string_function_with_locale_arg("utf8_lower", string, locale)
+}
+
+nse_funcs$str_to_upper <- function(string, locale = "en") {
+ .arrow_string_function_with_locale_arg("utf8_upper", string, locale)
+}
+
+nse_funcs$str_to_title <- function(string, locale = "en") {
+ .arrow_string_function_with_locale_arg("utf8_title", string, locale)
+}
Review comment:
It's already hidden in so far as it's not available to users of the
arrow package (without doing something special). I think the way it is is good
— no reason to prefix it with a dot
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]