jonkeane commented on a change in pull request #11232:
URL: https://github.com/apache/arrow/pull/11232#discussion_r718651282



##########
File path: r/R/dplyr-functions.R
##########
@@ -330,6 +330,39 @@ arrow_string_join_function <- function(null_handling, 
null_replacement = NULL) {
   }
 }
 
+# Currently, Arrow does not supports a locale option for string case conversion
+# functions, contrast to stringr's API, so the 'locale' argument is only valid
+# for stringr's default value ("en"). The following are string functions that
+# take a 'locale' option as its second argument:
+#   str_to_lower
+#   str_to_upper
+#   str_to_title
+#
+# Arrow locale will be supported with ARROW-14126
+nse_funcs$str_to_lower <- function(string, locale = "en") {
+  if (!identical(locale, "en")) {
+    stop("Providing 'locale' to 'str_to_lower' is not supported in Arrow; ",
+    "to change locale use 'Sys.setlocale()'", call. = FALSE)

Review comment:
       I like Nic's phrasing here, seeing it repeated three times: should we 
write/store the message once and use it in each `stop` so we know the same is 
being used. If we do want to keep the function name in it (I don't think we 
need to, but if we do) we could make a small helper that plots that in.

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -471,6 +471,46 @@ test_that("strsplit and str_split", {
   )
 })
 
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+  df <- tibble(x = c("Foo", " B\na R", "ⱭɽⱤoW", "ıI"))

Review comment:
       I've triggered our as-cran build to check this — we might want (or in 
fact need) to turn these non-ascii characters into their unicode escapes to not 
anger cran checking (cf 
https://github.com/apache/arrow/blob/master/r/tests/testthat/test-json.R#L26)

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -471,6 +471,46 @@ test_that("strsplit and str_split", {
   )
 })
 
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+  df <- tibble(x = c("Foo", " B\na R", "ⱭɽⱤoW", "ıI"))

Review comment:
       It did pass, though 
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Encoding-issues 
recommends using `\u...` for portability. 

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -467,6 +467,25 @@ test_that("strsplit and str_split", {
   )
 })
 
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+  df <- tibble(x = c("1Foo1", " \tB a R\n", "!apACHe aRroW!"))
+  funcs <- c(str_to_lower, str_to_upper, str_to_title)
+  for (func in funcs) {
+    expect_dplyr_equal(
+      input %>%
+        transmute(x = func(x)) %>%
+        collect(),
+      df
+    )
+
+    funcname = as.character(substitute(func))
+    expect_error(
+      nse_funcs[[funcname]]("Apache Arrow", locale = "sp"),
+      "Providing a value for 'locale' other than the default ('en') is not 
supported by Arrow"
+    )

Review comment:
       We could probably get away with testing the error on one function and 
not wrapping them all in this for loop.

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -467,6 +467,25 @@ test_that("strsplit and str_split", {
   )
 })
 
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+  df <- tibble(x = c("1Foo1", " \tB a R\n", "!apACHe aRroW!"))
+  funcs <- c(str_to_lower, str_to_upper, str_to_title)
+  for (func in funcs) {
+    expect_dplyr_equal(
+      input %>%
+        transmute(x = func(x)) %>%

Review comment:
       This should work, but because of how these tests are evaluated, I think 
something is getting lost leading to these errors: 
https://github.com/apache/arrow/pull/11232/checks?check_run_id=3748190500#step:8:17492

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -467,6 +467,25 @@ test_that("strsplit and str_split", {
   )
 })
 
+test_that("str_to_lower, str_to_upper, and str_to_title", {
+  df <- tibble(x = c("1Foo1", " \tB a R\n", "!apACHe aRroW!"))
+  funcs <- c(str_to_lower, str_to_upper, str_to_title)
+  for (func in funcs) {
+    expect_dplyr_equal(
+      input %>%
+        transmute(x = func(x)) %>%

Review comment:
       In other places, we have tested lists of functions like these by 
transmuting/mutating multiple columns 
https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dplyr.R#L996-L1009
   
   something like:
   
   ```
   expect_dplyr_equal(
          input %>%
            transmute(
              x_lower = str_to_lower(x),
              x_upper = str_to_upper(x),
              x_title = str_to_title(x)
            ) %>%
            collect()
   ```

##########
File path: r/R/dplyr-functions.R
##########
@@ -330,6 +330,35 @@ arrow_string_join_function <- function(null_handling, 
null_replacement = NULL) {
   }
 }
 
+# Currently, Arrow does not supports a locale option for string case conversion
+# functions, contrast to stringr's API, so the 'locale' argument is only valid
+# for stringr's default value ("en"). The following are string functions that
+# take a 'locale' option as its second argument:
+#   str_to_lower
+#   str_to_upper
+#   str_to_title
+#
+# Arrow locale will be supported with ARROW-14126
+.arrow_string_function_with_locale_arg <- function(func, string, locale) {
+  if (!identical(locale, "en")) {
+    stop("Providing a value for 'locale' other than the default ('en') is not 
supported by Arrow. ",
+    "To change locale, use 'Sys.setlocale()'", call. = FALSE)
+  }
+  Expression$create(func, string)
+}
+
+nse_funcs$str_to_lower <- function(string, locale = "en") {
+  .arrow_string_function_with_locale_arg("utf8_lower", string, locale)
+}
+
+nse_funcs$str_to_upper <- function(string, locale = "en") {
+  .arrow_string_function_with_locale_arg("utf8_upper", string, locale)
+}
+
+nse_funcs$str_to_title <- function(string, locale = "en") {
+  .arrow_string_function_with_locale_arg("utf8_title", string, locale)
+}

Review comment:
       Yes I agree, this also fits with other patterns we have of 
stopping/bailing with validation function and then producing what's needed 
otherwise.

##########
File path: r/R/dplyr-functions.R
##########
@@ -330,6 +330,35 @@ arrow_string_join_function <- function(null_handling, 
null_replacement = NULL) {
   }
 }
 
+# Currently, Arrow does not supports a locale option for string case conversion
+# functions, contrast to stringr's API, so the 'locale' argument is only valid
+# for stringr's default value ("en"). The following are string functions that
+# take a 'locale' option as its second argument:
+#   str_to_lower
+#   str_to_upper
+#   str_to_title
+#
+# Arrow locale will be supported with ARROW-14126
+.arrow_string_function_with_locale_arg <- function(func, string, locale) {
+  if (!identical(locale, "en")) {
+    stop("Providing a value for 'locale' other than the default ('en') is not 
supported by Arrow. ",
+    "To change locale, use 'Sys.setlocale()'", call. = FALSE)
+  }
+  Expression$create(func, string)
+}
+
+nse_funcs$str_to_lower <- function(string, locale = "en") {
+  .arrow_string_function_with_locale_arg("utf8_lower", string, locale)
+}
+
+nse_funcs$str_to_upper <- function(string, locale = "en") {
+  .arrow_string_function_with_locale_arg("utf8_upper", string, locale)
+}
+
+nse_funcs$str_to_title <- function(string, locale = "en") {
+  .arrow_string_function_with_locale_arg("utf8_title", string, locale)
+}

Review comment:
       It's already hidden in so far as it's not available to users of the 
arrow package (without doing something special). I think the way it is is good 
— no reason to prefix it with a dot




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to