[
https://issues.apache.org/jira/browse/ARROW-13259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17374907#comment-17374907
]
Joris Van den Bossche commented on ARROW-13259:
-----------------------------------------------
To copy over the practical example:
{code}
In [24]: import sys
In [25]: string = "Apache Arrow"
In [26]: pc.utf8_slice_codeunits(string, start=-5, stop=sys.maxsize)
Out[26]: <pyarrow.StringScalar: 'Arrow'>
In [27]: pc.utf8_slice_codeunits(string, start=-5, stop=-1)
Out[27]: <pyarrow.StringScalar: 'Arro'>
{code}
So "a large integer" can be used to indicate "slice until the end" (I suppose
because you can never have a scalar string with a longer length than that
value?).
In Python this is {{sys.maxsize}}, in C++ it's
{{std::numeric_limits<int64_t>::max()}}.
> [C++] Enable slicing to end of string using "utf8_slice_codeunits" when
> string length unknown or different lengths
> -------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-13259
> URL: https://issues.apache.org/jira/browse/ARROW-13259
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nic Crane
> Priority: Major
>
> We're currently trying to write bindings from the C++ function
> "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour
> of R's string::str_sub
> In both the R and C++ implementations, I can use negative indices to count
> back from the end of a string (show below in R, but the latter directly
> invokes the C++ implementation):
>
> {code:java}
> # stringr version
> > stringr::str_sub("Apache Arrow", -5, -2)
> [1] "Arro"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"),
> > options = list(start=-5L, stop=-1L))
> Scalar
> Arro{code}
> Note that in the C++ implementation, I have to add 1 to the stop value as the
> final value is non-inclusive.
> The problem is when I'm trying to use negative indices to refer to the final
> values in a string:
>
> {code:java}
> stringr version
> > stringr::str_sub("Apache Arrow", -5, -1)
> [1] "Arrow"
> # C++ version
> > call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"),
> > options = list(start=-5L, stop=0L))
> Scalar
> {code}
> The result is blank as the 'stop' value 0 refers to the start of the string,
> effective walking backwards, which isn't possible (except via the step
> argument which I can't get working but I don't think is what I want anyway).
> I've tried to get around this by attempting to write some code that
> calculates the length of the string and supply that to the stop argument, but
> it didn't work.
> I do have a possible workaround that involves reversing the string,
> extracting the substring using inverted values of swapped stop/start values,
> and then reversing the result, but before I go down that path, I was
> wondering if there is anything that can (and should! the answer may be a
> simple "nope!") be changed in the C++ code to make it possible to do this a
> different way?
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)