coussens commented on issue #43960:
URL: https://github.com/apache/arrow/issues/43960#issuecomment-2344576504
@thisisnic totally agree that the snippet you point out is the culprit for
the behavior in Example 1 above. I think the conditional logic could instead be
changed to the following in order to avoid returning empty strings for negative
values of `end`:
```
if(end < start & end > 0) {
end <- 0
start <- 0
}
```
Upon looking at the source code (sorry I hadn't taken the time to examine
before), I think more problems in addition to this (but please correct me if
I'm wrong!).
I believe the `end` argument in the C++ function `utf8_slice_codeunits` is
_exclusive_, so won't include `my_string[end]` in the returned substring. For
positive `end` values, this poses no issue, since in C++ is 0-based and R is
1-based, cancelling each other out. Note that for positive `start` values, the
[below snippet in the source
code](https://github.com/apache/arrow/blob/main/r/R/dplyr-funcs-string.R#L573)
addresses this:
```
# subtract 1 from `start` because C++ is 0-based and R is 1-based
# str_sub treats a `start` value of 0 or 1 as the same thing so don't
subtract 1 when `start` == 0
# when `start` < 0, both str_sub and utf8_slice_codeunits count
backwards from the end
if (start > 0) {
start <- start - 1L
}
```
For negative `start` values, this isn't necessary, as noted in source
comments above. But that leaves the issue of negative `end` values unsolved.
This explains why in my Example 2 above `arrow::str_sub` fails to include
my_string[-2] (i.e. only returns 'c' instead of the correct 'cd').
However, for just the [default] case when `end == -1`, this problem is
incidentally fixed by [this
code](https://github.com/apache/arrow/blob/main/r/R/dplyr-funcs-string.R#L567):
```
# In stringr::str_sub, an `end` value of -1 means the end of the string,
so
# set it to the maximum integer to match this behavior
if (end == -1) {
end <- .Machine$integer.max
}
```
which explains why `arrow::str_sub` works fine (as in my Example 3 above)
when `end == -1`, but not when `end < -1` (which confused me at first).
I'm happy to create a MR that should fix this, with the warning that while
I've been working with R for quite a while, I'm a total newb when it comes to
the collaborative development side of things (@thisisnic thanks so much for
your https://github.com/forwards/first-contributions tutorial!).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]