alamb commented on issue #1478: URL: https://github.com/apache/arrow-rs/issues/1478#issuecomment-1076768993
Hi @HaoYang670 -- I suggest 1. Leave the implementation in terms of bytes 2. Clarifying the documentation (your suggestion 1) 3. Verify that the `StringArray` created by `substring` contains only valid utf8 data and throwing a more specific error message if it does not. I am not sure if the code already does 3 or not. The challenge with proper unicode support, from my perspective, is that it will likely be slower and require a new dependency (to identify the unicode graphemes). There is a unicode aware implementation of substr in the datafusion repo I believe contributed by @ovr. https://github.com/apache/arrow-datafusion/blob/eb5a18a427bb718bffbf477c8fdf0230bb0a6242/datafusion-physical-expr/src/unicode_expressions.rs#L413-L441 Another possibility is to add an optional feature flag to arrow-rs for "unicode" string support and base the behavior on that flag. But that sounds a little over complicated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
