coussens opened a new pull request, #44141:
URL: https://github.com/apache/arrow/pull/44141

   First-time contributor here, so let me know where I can improve!
   
   ### Rationale for this change
   
   The `str_sub` binding in arrow was not handling negative `end` values 
properly. The problem was two-fold: 
    1.  When `end` values were negative (and less than the `start` value, which 
might be positive), `str_sub` would improperly return an empty string. 
   2.  When `end` values were < -1 but the `end` position was still to the 
right of the `start` position, `str_sub` failed to return the final character 
in the substring, since it did not account for the fact that `end` is counted 
exclusively in the underlying C++ function (`utf8_slice_codeunits`), but 
inclusively in R. 
   
   See discussion/examples at https://github.com/apache/arrow/issues/43960 for 
details.
   
   ### What changes are included in this PR?
   1. The removal of lines from `r/R/dplyr-funcs-string.R` that previously set 
`end`= 0 when `start < end`, which meant if the user was counting backwards 
from the end of the string (with a negative `end` value), an empty string would 
[wrongly] be returned. It appears that the case that the previous code was 
trying to address is already handled properly by the underlying C++ function 
(`utf8_slice_codeunits`).
   2. Addition of lines to `r/R/dplyr-funcs-string.R` in order to account the 
difference in between R's inclusive `end` and C++'s exclusive `end` when `end` 
is negative.
   3. The addition of a test (described below) to 
`r/tests/testthat/test-dplyr-funcs-string.R` to test for these cases.
   
   ### Are these changes tested?
   
   Yes, I ran all tests in `r/tests/testthat/test-dplyr-funcs-string.R`, 
including one which I added (see attached commit), which explicitly tests the 
case where `end` is negative (-3) and less than the `start` value (1). This 
also tests the case where `end`  < -1 and to the right of the `start` position.
   
   ### Are there any user-facing changes?
   
   No.
   
   
   **This PR contains a "Critical Fix".** Previously:
   - When `end` values were negative (and less than the `start` value, which 
might be positive), `str_sub` would improperly return an empty string. 
   - When `end` values were < -1 but the `end` position was still to the right 
of the `start` position, `str_sub` failed to return the final character in the 
substring, since it did not account for the fact that `end` is counted 
exclusively in the underlying C++ function (`utf8_slice_codeunits`), but 
inclusively in R. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to