coussens opened a new pull request, #44141:
URL: https://github.com/apache/arrow/pull/44141
First-time contributor here, so let me know where I can improve!
### Rationale for this change
The `str_sub` binding in arrow was not handling negative `end` values
properly. The problem was two-fold:
1. When `end` values were negative (and less than the `start` value, which
might be positive), `str_sub` would improperly return an empty string.
2. When `end` values were < -1 but the `end` position was still to the
right of the `start` position, `str_sub` failed to return the final character
in the substring, since it did not account for the fact that `end` is counted
exclusively in the underlying C++ function (`utf8_slice_codeunits`), but
inclusively in R.
See discussion/examples at https://github.com/apache/arrow/issues/43960 for
details.
### What changes are included in this PR?
1. The removal of lines from `r/R/dplyr-funcs-string.R` that previously set
`end`= 0 when `start < end`, which meant if the user was counting backwards
from the end of the string (with a negative `end` value), an empty string would
[wrongly] be returned. It appears that the case that the previous code was
trying to address is already handled properly by the underlying C++ function
(`utf8_slice_codeunits`).
2. Addition of lines to `r/R/dplyr-funcs-string.R` in order to account the
difference in between R's inclusive `end` and C++'s exclusive `end` when `end`
is negative.
3. The addition of a test (described below) to
`r/tests/testthat/test-dplyr-funcs-string.R` to test for these cases.
### Are these changes tested?
Yes, I ran all tests in `r/tests/testthat/test-dplyr-funcs-string.R`,
including one which I added (see attached commit), which explicitly tests the
case where `end` is negative (-3) and less than the `start` value (1). This
also tests the case where `end` < -1 and to the right of the `start` position.
### Are there any user-facing changes?
No.
**This PR contains a "Critical Fix".** Previously:
- When `end` values were negative (and less than the `start` value, which
might be positive), `str_sub` would improperly return an empty string.
- When `end` values were < -1 but the `end` position was still to the right
of the `start` position, `str_sub` failed to return the final character in the
substring, since it did not account for the fact that `end` is counted
exclusively in the underlying C++ function (`utf8_slice_codeunits`), but
inclusively in R.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]