coussens commented on issue #43960:
URL: https://github.com/apache/arrow/issues/43960#issuecomment-2344576504

   @thisisnic totally agree that the snippet you point out is the culprit for 
the behavior in Example 1 above. I think the conditional logic could instead be 
changed to the following in order to avoid returning empty strings for negative 
values of `end`:
   
   ```
       if(end < start & end > 0) {
         end <- 0
         start <- 0
       }
   ```
   
   Upon looking at the source code (sorry I hadn't taken the time to examine 
before), I think more problems in addition to this (but please correct me if 
I'm wrong!). 
   
   I believe the `end` argument in the C++ function `utf8_slice_codeunits` is 
_exclusive_, so won't include `my_string[end]` in the returned substring. For 
positive `end` values, this poses no issue, since in C++ is 0-based and R is 
1-based, cancelling each other out. Note that for positive `start` values, the 
[below snippet in the source 
code](https://github.com/apache/arrow/blob/main/r/R/dplyr-funcs-string.R#L573) 
addresses this:
   
   ```
       # subtract 1 from `start` because C++ is 0-based and R is 1-based
       # str_sub treats a `start` value of 0 or 1 as the same thing so don't 
subtract 1 when `start` == 0
       # when `start` < 0, both str_sub and utf8_slice_codeunits count 
backwards from the end
       if (start > 0) {
         start <- start - 1L
       }
   
   ```
   For negative `start` values, this isn't necessary, as noted in source 
comments above. But that leaves the issue of negative `end` values unsolved. 
This explains why in my Example 2 above `arrow::str_sub` fails to include 
my_string[-2] (i.e. only returns 'c' instead of the correct 'cd'). 
   
   However, for just the [default] case when `end == -1`, this problem is 
incidentally fixed by [this 
code](https://github.com/apache/arrow/blob/main/r/R/dplyr-funcs-string.R#L567):
   
   ```
       # In stringr::str_sub, an `end` value of -1 means the end of the string, 
so
       # set it to the maximum integer to match this behavior
       if (end == -1) {
         end <- .Machine$integer.max
       }
   ```
   which explains why `arrow::str_sub` works fine (as in my Example 3 above) 
when `end == -1`, but not when `end < -1` (which confused me at first).
   
   I'm happy to create a MR that should fix this, with the warning that while 
I've been working with R for quite a while, I'm a total newb when it comes to 
the collaborative development side of things (@thisisnic thanks so much for 
your https://github.com/forwards/first-contributions tutorial!). 
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to