This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 37f62d0bc5 GH-43960: [R] fix `str_sub` binding to properly handle 
negative `end` values (#44141)
37f62d0bc5 is described below

commit 37f62d0bc5f4d22e7194947963b445225b984558
Author: Stephen Coussens <[email protected]>
AuthorDate: Sat Sep 21 03:05:10 2024 -0700

    GH-43960: [R] fix `str_sub` binding to properly handle negative `end` 
values (#44141)
    
    First-time contributor here, so let me know where I can improve!
    
    ### Rationale for this change
    
    The `str_sub` binding in arrow was not handling negative `end` values 
properly. The problem was two-fold:
     1.  When `end` values were negative (and less than the `start` value, 
which might be positive), `str_sub` would improperly return an empty string.
    2.  When `end` values were < -1 but the `end` position was still to the 
right of the `start` position, `str_sub` failed to return the final character 
in the substring, since it did not account for the fact that `end` is counted 
exclusively in the underlying C++ function (`utf8_slice_codeunits`), but 
inclusively in R.
    
    See discussion/examples at https://github.com/apache/arrow/issues/43960 for 
details.
    
    ### What changes are included in this PR?
    1. The removal of lines from `r/R/dplyr-funcs-string.R` that previously set 
`end`= 0 when `start < end`, which meant if the user was counting backwards 
from the end of the string (with a negative `end` value), an empty string would 
[wrongly] be returned. It appears that the case that the previous code was 
trying to address is already handled properly by the underlying C++ function 
(`utf8_slice_codeunits`).
    2. Addition of lines to `r/R/dplyr-funcs-string.R` in order to account the 
difference in between R's inclusive `end` and C++'s exclusive `end` when `end` 
is negative.
    3. The addition of a test (described below) to 
`r/tests/testthat/test-dplyr-funcs-string.R` to test for these cases.
    
    ### Are these changes tested?
    
    Yes, I ran all tests in `r/tests/testthat/test-dplyr-funcs-string.R`, 
including one which I added (see attached commit), which explicitly tests the 
case where `end` is negative (-3) and less than the `start` value (1). This 
also tests the case where `end`  < -1 and to the right of the `start` position.
    
    ### Are there any user-facing changes?
    
    No.
    
    **This PR contains a "Critical Fix".** Previously:
    - When `end` values were negative (and less than the `start` value, which 
might be positive), `str_sub` would improperly return an empty string.
    - When `end` values were < -1 but the `end` position was still to the right 
of the `start` position, `str_sub` failed to return the final character in the 
substring, since it did not account for the fact that `end` is counted 
exclusively in the underlying C++ function (`utf8_slice_codeunits`), but 
inclusively in R.
    * GitHub Issue: #43960
    
    Lead-authored-by: Stephen Coussens <[email protected]>
    Co-authored-by: Nic Crane <[email protected]>
    Signed-off-by: Nic Crane <[email protected]>
---
 r/R/dplyr-funcs-string.R                   | 10 ++++++----
 r/tests/testthat/test-dplyr-funcs-string.R |  9 +++++++++
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/r/R/dplyr-funcs-string.R b/r/R/dplyr-funcs-string.R
index 77e1a5405a..28db78f609 100644
--- a/r/R/dplyr-funcs-string.R
+++ b/r/R/dplyr-funcs-string.R
@@ -570,10 +570,12 @@ register_bindings_string_other <- function() {
       end <- .Machine$integer.max
     }
 
-    # An end value lower than a start value returns an empty string in
-    # stringr::str_sub so set end to 0 here to match this behavior
-    if (end < start) {
-      end <- 0
+    # strings returned by utf8_slice_codeunits are exclusive of the `end` 
position.
+    # stringr::str_sub returns strings inclusive of the `end` position, so add 
1 to `end`.
+    # NOTE:this isn't necessary for positive values of `end`, because 
utf8_slice_codeunits
+    # is 0-based while R is 1-based, which cancels out the effect of the 
exclusive `end`
+    if (end < -1) {
+      end <- end + 1L
     }
 
     # subtract 1 from `start` because C++ is 0-based and R is 1-based
diff --git a/r/tests/testthat/test-dplyr-funcs-string.R 
b/r/tests/testthat/test-dplyr-funcs-string.R
index cb1d467505..86966b3053 100644
--- a/r/tests/testthat/test-dplyr-funcs-string.R
+++ b/r/tests/testthat/test-dplyr-funcs-string.R
@@ -1178,6 +1178,15 @@ test_that("str_sub", {
       collect(),
     df
   )
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        y = str_sub(x, 1, -3),
+        y2 = stringr::str_sub(x, 1, -3)
+      ) %>%
+      collect(),
+    df
+  )
 
   expect_arrow_eval_error(
     str_sub("Apache Arrow", c(1, 2), 3),

Reply via email to