[GitHub] [arrow] nealrichardson commented on a change in pull request #10190: ARROW-11515: [R] Bindings for strsplit

GitBox Fri, 07 May 2021 08:35:48 -0700


nealrichardson commented on a change in pull request #10190:
URL: https://github.com/apache/arrow/pull/10190#discussion_r628314240




##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -239,7 +254,172 @@ test_that("str_replace and str_replace_all", {
 
 })
 
-test_that("backreferences in pattern", {
+test_that("strsplit and str_split", {
+
+  df <- tibble(x = c("Foo and bar", "baz and qux and quux"))
+
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = strsplit(x, "and")) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = strsplit(x, "and.*", fixed = TRUE)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, "and")) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, "and", n = 2)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, fixed("and"), n = 2)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, regex("and"), n = 2)) %>%
+      collect(),
+    df
+  )
+
+})
+
+test_that("arrow_*_split_whitespace functions", {
+
+  # use only ASCII whitespace characters
+  df_ascii <- tibble(x = c("Foo\nand bar", "baz\tand qux and quux"))
+
+  # use only non-ASCII whitespace characters
+  df_utf8 <- tibble(x = c("Foo\u00A0and\u2000bar", 
"baz\u2006and\u1680qux\u3000and\u2008quux"))
+
+  df_split <- tibble(x = list(c("Foo", "and", "bar"), c("baz", "and", "qux", 
"and", "quux")))
+
+  expect_equivalent(
+    df_ascii %>%
+      Table$create() %>%
+      mutate(x = arrow_ascii_split_whitespace(x)) %>%
+      collect(),
+    df_split
+  )
+  expect_equivalent(
+    df_utf8 %>%
+      Table$create() %>%
+      mutate(x = arrow_utf8_split_whitespace(x)) %>%
+      collect(),
+    df_split
+  )
+
+})
+
+test_that("errors and warnings in string splitting", {
+  df <- tibble(x = c("Foo and bar", "baz and qux and quux"))
+
+  # These conditions generate an error, but abandon_ship() catches the error,
+  # issues a warning, and pulls the data into R
+  expect_warning(
+    df %>%
+      Table$create() %>%
+      mutate(x = strsplit(x, "and.*", fixed = FALSE)) %>%
+      collect(),
+    regexp = "not supported"

Review comment:
       "not supported" does not uniquely tell you which validation is catching 
this (assuming you're testing different inputs to test each of the validations 
you added). Since you're not asserting anything about the actual output, and 
we're testing elsewhere that `abandon_ship`, and `mutate` for that matter, 
work, maybe these should be more like unit tests. 
   
   Something like: 
   
   ```r
   x <- Expression$field_ref("x")
   expect_warning(
     dplyr_functions$dataset$strsplit(x, "and.*", fixed = FALSE),
     "The full message"
   )
   ```

##########
File path: r/R/dplyr.R
##########
@@ -539,6 +541,44 @@ arrow_stringr_string_replace_function <- function(FUN, 
max_replacements) {
   }
 }
 
+arrow_r_string_split_function <- function(FUN, reverse = FALSE, max_splits = 
-1) {
+  function(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE) {
+    
+    assert_that(is.string(split))
+    
+    # if !fixed but no regex metachars in split pattern, allow to proceed as 
split isn't regex
+    if (!fixed && contains_regex(split)) {
+      stop("Regular expression matching not supported in strsplit for Arrow", 
call. = FALSE)
+    }
+    if (fixed && perl) {
+      warning("Argument 'perl = TRUE' will be ignored", call. = FALSE)
+    }
+    FUN("split_pattern", x, options = list(pattern = split, reverse = reverse, 
max_splits = max_splits))
+  }
+}
+
+arrow_stringr_string_split_function <- function(FUN, reverse = FALSE) {
+  function(string, pattern, n = Inf, simplify = FALSE) {
+    opts <- get_stringr_pattern_options(enexpr(pattern))
+    if (!opts$fixed && contains_regex(opts$pattern)) {
+      stop("Regular expression matching not supported in str_split() for 
Arrow", call. = FALSE)
+    }
+    if (opts$ignore_case) {
+      stop("Case-insensitive string splitting not supported in Arrow", call. = 
FALSE)
+    }
+    if (n == 0) {
+      stop("Splitting strings into zero parts not supported in Arrow" , call. 
= FALSE)
+    }
+    if (identical(n, Inf)) {
+      n <- 0L
+    }
+    if (simplify) {
+      warning("Argument 'simplify = TRUE' will be ignored", call. = FALSE)
+    }
+    FUN("split_pattern", string, options = list(pattern = opts$pattern, 
reverse = reverse, max_splits = n - 1L))

Review comment:
       Are we sure that this max_splits should be -1? Like, if I want max 2 
splits, I should provide `max_splits = 1`? That sounds quite wrong.
   
   Can you point me at the tests for this behavior? (I don't think there are 
tests of this from R in this PR, but there should be C++ tests.)

##########
File path: r/R/dplyr.R
##########
@@ -539,6 +541,44 @@ arrow_stringr_string_replace_function <- function(FUN, 
max_replacements) {
   }
 }
 
+arrow_r_string_split_function <- function(FUN, reverse = FALSE, max_splits = 
-1) {
+  function(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE) {
+    
+    assert_that(is.string(split))
+    
+    # if !fixed but no regex metachars in split pattern, allow to proceed as 
split isn't regex
+    if (!fixed && contains_regex(split)) {
+      stop("Regular expression matching not supported in strsplit for Arrow", 
call. = FALSE)
+    }
+    if (fixed && perl) {

Review comment:
       Do we only ignore the `perl` argument if `fixed = TRUE`? 

##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -239,7 +254,172 @@ test_that("str_replace and str_replace_all", {
 
 })
 
-test_that("backreferences in pattern", {
+test_that("strsplit and str_split", {
+
+  df <- tibble(x = c("Foo and bar", "baz and qux and quux"))
+
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = strsplit(x, "and")) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = strsplit(x, "and.*", fixed = TRUE)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, "and")) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, "and", n = 2)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, fixed("and"), n = 2)) %>%
+      collect(),
+    df
+  )
+  expect_dplyr_equal(
+    input %>%
+      mutate(x = str_split(x, regex("and"), n = 2)) %>%
+      collect(),
+    df
+  )
+
+})
+
+test_that("arrow_*_split_whitespace functions", {
+
+  # use only ASCII whitespace characters
+  df_ascii <- tibble(x = c("Foo\nand bar", "baz\tand qux and quux"))
+
+  # use only non-ASCII whitespace characters
+  df_utf8 <- tibble(x = c("Foo\u00A0and\u2000bar", 
"baz\u2006and\u1680qux\u3000and\u2008quux"))
+
+  df_split <- tibble(x = list(c("Foo", "and", "bar"), c("baz", "and", "qux", 
"and", "quux")))
+
+  expect_equivalent(
+    df_ascii %>%
+      Table$create() %>%
+      mutate(x = arrow_ascii_split_whitespace(x)) %>%

Review comment:
       I think the only reason you'd want to test these functions in the R 
package is to test that you're passing the extra C++ options through (that 
you've wired that up in compute.cpp correctly). So let's test them.

##########
File path: r/src/compute.cpp
##########
@@ -233,6 +233,33 @@ std::shared_ptr<arrow::compute::FunctionOptions> 
make_compute_options(
                                      max_replacements);
   }
 
+  if (func_name == "split_pattern") {
+    using Options = arrow::compute::SplitPatternOptions;
+    int64_t max_splits = -1;

Review comment:
       Rather than setting defaults manually, can you do `auto out = 
std::make_shared<Options>(Options::Defaults());` and then just set them if 
they're provided? (We do this above for other types.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a change in pull request #10190: ARROW-11515: [R] Bindings for strsplit

Reply via email to