thisisnic commented on a change in pull request #10624:
URL: https://github.com/apache/arrow/pull/10624#discussion_r670254295
##########
File path: r/R/dplyr-functions.R
##########
@@ -280,6 +284,81 @@ nse_funcs$str_trim <- function(string, side = c("both",
"left", "right")) {
Expression$create(trim_fun, string)
}
+nse_funcs$substr <- function(string, start, stop) {
+ assert_that(
+ length(start) == 1,
+ msg = "`start` must be length 1 - other lengths are not supported in Arrow"
+ )
+ assert_that(
+ length(stop) == 1,
+ msg = "`stop` must be length 1 - other lengths are not supported in Arrow"
+ )
+
+ if (start <= 0) {
+ start <- 1
+ }
+
+ if (stop < start) {
+ stop <- 0
+ }
+
+ Expression$create(
+ "utf8_slice_codeunits",
+ string,
+ options = list(start = start - 1L, stop = stop)
+ )
+}
+
+nse_funcs$substring <- function(text, first, last = 1000000L) {
+ assert_that(
+ length(first) == 1,
+ msg = "`first` must be length 1 - other lengths are not supported in Arrow"
+ )
+ assert_that(
+ length(last) == 1,
+ msg = "`last` must be length 1 - other lengths are not supported in Arrow"
+ )
+
+ if (first <= 0) {
+ first <- 1
+ }
+
+ if (last < first) {
+ last <- 0
+ }
+
+ Expression$create(
+ "utf8_slice_codeunits",
+ text,
+ options = list(start = first - 1L, stop = last)
+ )
+}
+
+nse_funcs$str_sub <- function(string, start = 1L, end = -1L) {
+ assert_that(
+ length(start) == 1,
+ msg = "`start` must be length 1 - other lengths are not supported in Arrow"
+ )
+ assert_that(
+ length(end) == 1,
+ msg = "`end` must be length 1 - other lengths are not supported in Arrow"
+ )
+
+ if (start == 0) start <- 1
+
+ if (end == -1) end <- .Machine$integer.max
+
+ if (end < start) end <- 0
+
+ if (start > 0) start <- start - 1L
Review comment:
The subtracting of 1 is done in this code block in this function,
because it's conditional on `start` being greater than 0.
The other versions don't allow using negative values to count from the end
backwards, so while in the others start <= 0 isn't valid, here it is.
We only subtract 1 from start when start is > 0 because:
* we normally need to subtract 1 from `start` because C++ is 0-based and R
is 1-based so they're counting from different points
* we don't need to subtract 1 from `end` as R counts inclusively (i.e.
returned string includes the item at position `end`) whereas C++ counts
exclusively (i.e. returned string includes *up to* the item at position `end`
which effectively cancels out the difference due to indexing
* `str_sub` treats a `start` value of 0 or 1 as the same thing, which is why
here, the subtraction is not done when `start` == 0 (and so resulting in them
both passing a `start` value of 0 being passed to `utf8_slice_codeunits`)
* a `start` value < 0 is valid as both `str_sub` and `utf8_slice_codeunits`
can count backwards from the end with -1 being the final character in the
string, -2 being the second to last character, etc.
I'll add some of this to the code in the form of a comment
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]