HaoYang670 opened a new issue #1478:
URL: https://github.com/apache/arrow-rs/issues/1478


   **Describe the bug**
   The `substring` kernel can only work on chars that are encoded as 1 byte in 
utf-8 standard. If the string contains a char that is more than 1 byte, It will 
panic.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   Give a string `"E=mc²"`, start index = `-1`, length = `None`. 
   the expected result is "²".
   However, I got:
   ```
   thread 'compute::kernels::substring::tests::without_nulls_string' panicked 
at 'byte index 2 is out of bounds of `�`', library/core/src/fmt/mod.rs:2160:30
   ```
   
   The reason is that the char `²` is encoded as `0xC2 0xB2` in utf8 standard. 
When we tried to get the last char in string, what we really get is a byte 
sequence `[0xB2]` which is invalid in utf-8 standard.
   
   **Expected behavior**
   I think there are three ways to fix the bug:
   1.(easy) Update the doc of the  `substring` function to explain we only 
support 1-byte utf-8 chars. Also explain that `start` and `length` are counted 
in bytes. 
   2.(a little difficult) check the string array only contains 1-byte utf-8 
chars (the highest-order bit is 0) in the `substring` function.
   3.(difficult, and the API will be changed) Intercept based on characters, 
not bytes.
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to