Omega359 opened a new issue, #9026:
URL: https://github.com/apache/arrow-datafusion/issues/9026

   ### Describe the bug
   
   the instr function doesn't correctly return the position of the matched 
string, rather it returns the position of the matched byte index. The issue is 
in this code in string_expressions.rs
   ```
   .map(|(string, substr)| match (string, substr) {
                       (Some(string), Some(substr)) => string
                           .find(substr)
                           .map_or(Some(0), |index| Some((index + 1) as i32)),
                       _ => None,
                   })
   ```
   See the documentation for the find function.
   
   The expected result from the end users perspective is not the byte index but 
rather the index of the first matching unicode grapheme in the string.
   
   ### To Reproduce
   
   ❯ docker run -it -v /tmp:/data datafusion-cli
   DataFusion CLI v35.0.0
   ❯ select instr('实现这个函数', '函')
   ;
   +----------------------------------------+
   | instr(Utf8("实现这个函数"),Utf8("函")) |
   +----------------------------------------+
   | 13                                     |
   +----------------------------------------+
   1 row in set. Query took 0.002 seconds.
   
   5 should be the correct position.
   
   ### Expected behavior
   
   The returned position should be the position of the first matching unicode 
grapheme in the string, not the first matching byte index.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to