Mike Seddon created ARROW-11339:
-----------------------------------
Summary: [Rust][DataFusion] length kernel does not correctly
calculate character length
Key: ARROW-11339
URL: https://issues.apache.org/jira/browse/ARROW-11339
Project: Apache Arrow
Issue Type: Bug
Components: Rust - DataFusion
Reporter: Mike Seddon
Assignee: Mike Seddon
The current kernel works for simple characters as it appears to be assuming
that 1 byte = 1 character. this is very fast but is not a safe assumption given
Arrow strings are utf8.
A simple example of failure is from the Postgres example where the current
`length` implementation will calculate 5.
`char_length('josé') → 4`
The correct method seems to be via
https://docs.rs/unicode-segmentation/1.2.1/unicode_segmentation/struct.Graphemes.html
which I can implement in my work here:
https://github.com/apache/arrow/pull/9243 and remove from kernel.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)