HaoYang670 opened a new issue #1478: URL: https://github.com/apache/arrow-rs/issues/1478
**Describe the bug** The `substring` kernel can only work on chars that are encoded as 1 byte in utf-8 standard. If the string contains a char that is more than 1 byte, It will panic. **To Reproduce** Steps to reproduce the behavior: Give a string `"E=mc²"`, start index = `-1`, length = `None`. the expected result is "²". However, I got: ``` thread 'compute::kernels::substring::tests::without_nulls_string' panicked at 'byte index 2 is out of bounds of `�`', library/core/src/fmt/mod.rs:2160:30 ``` The reason is that the char `²` is encoded as `0xC2 0xB2` in utf8 standard. When we tried to get the last char in string, what we really get is a byte sequence `[0xB2]` which is invalid in utf-8 standard. **Expected behavior** I think there are three ways to fix the bug: 1.(easy) Update the doc of the `substring` function to explain we only support 1-byte utf-8 chars. Also explain that `start` and `length` are counted in bytes. 2.(a little difficult) check the string array only contains 1-byte utf-8 chars (the highest-order bit is 0) in the `substring` function. 3.(difficult, and the API will be changed) Intercept based on characters, not bytes. **Additional context** Add any other context about the problem here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
