[
https://issues.apache.org/jira/browse/ARROW-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307436#comment-17307436
]
Eduardo Ponce commented on ARROW-11693:
---------------------------------------
Well, thinking more about it, simply checking against 0x80 also includes
invalid sequences because not all sequences in that range are valid UTF8.
Consider invalid bytes, invalid continuation bytes, invalid sequence, etc.
utf8proc decodes taking into consideration these cases:
https://github.com/JuliaStrings/utf8proc/blob/master/utf8proc.c#L124-L171
> [C++] Add string length kernel
> ------------------------------
>
> Key: ARROW-11693
> URL: https://issues.apache.org/jira/browse/ARROW-11693
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Neal Richardson
> Assignee: Eduardo Ponce
> Priority: Major
> Fix For: 4.0.0
>
>
> We have "binary_length" but that doesn't handle UTF-8 the way we need for
> this. Example (from R):
> {code}
> > string <- "áéíóú"
> > nchar(string)
> [1] 5
> > arrow:::call_function("binary_length", Scalar$create(string))
> Scalar
> 10
> {code}
> cc [~maartenbreddels] [~apitrou] [~jorisvandenbossche]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)