This is an automated email from the ASF dual-hosted git repository.
thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 37f62d0bc5 GH-43960: [R] fix `str_sub` binding to properly handle
negative `end` values (#44141)
37f62d0bc5 is described below
commit 37f62d0bc5f4d22e7194947963b445225b984558
Author: Stephen Coussens <[email protected]>
AuthorDate: Sat Sep 21 03:05:10 2024 -0700
GH-43960: [R] fix `str_sub` binding to properly handle negative `end`
values (#44141)
First-time contributor here, so let me know where I can improve!
### Rationale for this change
The `str_sub` binding in arrow was not handling negative `end` values
properly. The problem was two-fold:
1. When `end` values were negative (and less than the `start` value,
which might be positive), `str_sub` would improperly return an empty string.
2. When `end` values were < -1 but the `end` position was still to the
right of the `start` position, `str_sub` failed to return the final character
in the substring, since it did not account for the fact that `end` is counted
exclusively in the underlying C++ function (`utf8_slice_codeunits`), but
inclusively in R.
See discussion/examples at https://github.com/apache/arrow/issues/43960 for
details.
### What changes are included in this PR?
1. The removal of lines from `r/R/dplyr-funcs-string.R` that previously set
`end`= 0 when `start < end`, which meant if the user was counting backwards
from the end of the string (with a negative `end` value), an empty string would
[wrongly] be returned. It appears that the case that the previous code was
trying to address is already handled properly by the underlying C++ function
(`utf8_slice_codeunits`).
2. Addition of lines to `r/R/dplyr-funcs-string.R` in order to account the
difference in between R's inclusive `end` and C++'s exclusive `end` when `end`
is negative.
3. The addition of a test (described below) to
`r/tests/testthat/test-dplyr-funcs-string.R` to test for these cases.
### Are these changes tested?
Yes, I ran all tests in `r/tests/testthat/test-dplyr-funcs-string.R`,
including one which I added (see attached commit), which explicitly tests the
case where `end` is negative (-3) and less than the `start` value (1). This
also tests the case where `end` < -1 and to the right of the `start` position.
### Are there any user-facing changes?
No.
**This PR contains a "Critical Fix".** Previously:
- When `end` values were negative (and less than the `start` value, which
might be positive), `str_sub` would improperly return an empty string.
- When `end` values were < -1 but the `end` position was still to the right
of the `start` position, `str_sub` failed to return the final character in the
substring, since it did not account for the fact that `end` is counted
exclusively in the underlying C++ function (`utf8_slice_codeunits`), but
inclusively in R.
* GitHub Issue: #43960
Lead-authored-by: Stephen Coussens <[email protected]>
Co-authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
---
r/R/dplyr-funcs-string.R | 10 ++++++----
r/tests/testthat/test-dplyr-funcs-string.R | 9 +++++++++
2 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/r/R/dplyr-funcs-string.R b/r/R/dplyr-funcs-string.R
index 77e1a5405a..28db78f609 100644
--- a/r/R/dplyr-funcs-string.R
+++ b/r/R/dplyr-funcs-string.R
@@ -570,10 +570,12 @@ register_bindings_string_other <- function() {
end <- .Machine$integer.max
}
- # An end value lower than a start value returns an empty string in
- # stringr::str_sub so set end to 0 here to match this behavior
- if (end < start) {
- end <- 0
+ # strings returned by utf8_slice_codeunits are exclusive of the `end`
position.
+ # stringr::str_sub returns strings inclusive of the `end` position, so add
1 to `end`.
+ # NOTE:this isn't necessary for positive values of `end`, because
utf8_slice_codeunits
+ # is 0-based while R is 1-based, which cancels out the effect of the
exclusive `end`
+ if (end < -1) {
+ end <- end + 1L
}
# subtract 1 from `start` because C++ is 0-based and R is 1-based
diff --git a/r/tests/testthat/test-dplyr-funcs-string.R
b/r/tests/testthat/test-dplyr-funcs-string.R
index cb1d467505..86966b3053 100644
--- a/r/tests/testthat/test-dplyr-funcs-string.R
+++ b/r/tests/testthat/test-dplyr-funcs-string.R
@@ -1178,6 +1178,15 @@ test_that("str_sub", {
collect(),
df
)
+ compare_dplyr_binding(
+ .input %>%
+ mutate(
+ y = str_sub(x, 1, -3),
+ y2 = stringr::str_sub(x, 1, -3)
+ ) %>%
+ collect(),
+ df
+ )
expect_arrow_eval_error(
str_sub("Apache Arrow", c(1, 2), 3),