ianmcook commented on a change in pull request #10190:
URL: https://github.com/apache/arrow/pull/10190#discussion_r628524212
##########
File path: r/tests/testthat/test-dplyr-string-functions.R
##########
@@ -239,7 +254,172 @@ test_that("str_replace and str_replace_all", {
})
-test_that("backreferences in pattern", {
+test_that("strsplit and str_split", {
+
+ df <- tibble(x = c("Foo and bar", "baz and qux and quux"))
+
+ expect_dplyr_equal(
+ input %>%
+ mutate(x = strsplit(x, "and")) %>%
+ collect(),
+ df
+ )
+ expect_dplyr_equal(
+ input %>%
+ mutate(x = strsplit(x, "and.*", fixed = TRUE)) %>%
+ collect(),
+ df
+ )
+ expect_dplyr_equal(
+ input %>%
+ mutate(x = str_split(x, "and")) %>%
+ collect(),
+ df
+ )
+ expect_dplyr_equal(
+ input %>%
+ mutate(x = str_split(x, "and", n = 2)) %>%
+ collect(),
+ df
+ )
+ expect_dplyr_equal(
+ input %>%
+ mutate(x = str_split(x, fixed("and"), n = 2)) %>%
+ collect(),
+ df
+ )
+ expect_dplyr_equal(
+ input %>%
+ mutate(x = str_split(x, regex("and"), n = 2)) %>%
+ collect(),
+ df
+ )
+
+})
+
+test_that("arrow_*_split_whitespace functions", {
+
+ # use only ASCII whitespace characters
+ df_ascii <- tibble(x = c("Foo\nand bar", "baz\tand qux and quux"))
+
+ # use only non-ASCII whitespace characters
+ df_utf8 <- tibble(x = c("Foo\u00A0and\u2000bar",
"baz\u2006and\u1680qux\u3000and\u2008quux"))
+
+ df_split <- tibble(x = list(c("Foo", "and", "bar"), c("baz", "and", "qux",
"and", "quux")))
+
+ expect_equivalent(
+ df_ascii %>%
+ Table$create() %>%
+ mutate(x = arrow_ascii_split_whitespace(x)) %>%
+ collect(),
+ df_split
+ )
+ expect_equivalent(
+ df_utf8 %>%
+ Table$create() %>%
+ mutate(x = arrow_utf8_split_whitespace(x)) %>%
+ collect(),
+ df_split
+ )
+
+})
+
+test_that("errors and warnings in string splitting", {
+ df <- tibble(x = c("Foo and bar", "baz and qux and quux"))
+
+ # These conditions generate an error, but abandon_ship() catches the error,
+ # issues a warning, and pulls the data into R
+ expect_warning(
+ df %>%
+ Table$create() %>%
+ mutate(x = strsplit(x, "and.*", fixed = FALSE)) %>%
+ collect(),
+ regexp = "not supported"
Review comment:
I agree, but I think this is a bigger problem affecting our dplyr tests,
and I don't think it makes sense to solve it only in the context of this
particular function. There are at least dozens of tests added in many different
PRs that manifest this same problem you describe.
Are you OK with me opening a separate Jira, articulating this problem in it,
and starting a discussion there about how best to resolve it? There are
considerations such as: do we intend to keep `abandon_ship` working as it does
even after our dplyr support is much more comprehensive? Could we perhaps
modify `abandon_ship` to incorporate the original error message into the
warning it gives instead of throwing a generic and likely less helpful error
message? I think such questions are better discussed outside the scope of this
one PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]