[GitHub] [arrow-rs] alamb commented on issue #1478: The `substring` kernel panics when chars > U+0x007F

GitBox Wed, 23 Mar 2022 13:04:58 -0700


alamb commented on issue #1478:
URL: https://github.com/apache/arrow-rs/issues/1478#issuecomment-1076768993



   Hi @HaoYang670 -- I suggest 
   1. Leave the implementation in terms of bytes
   2. Clarifying the documentation (your suggestion 1)
   3. Verify that the `StringArray` created by `substring` contains only valid 
utf8 data and throwing a more specific error message if it does not. 
   
   I am not sure if the code already does 3 or not. 
   
   The challenge with proper unicode support, from my perspective, is that it 
will likely be slower and require a new dependency (to identify the unicode 
graphemes). There is a unicode aware implementation of substr in the datafusion 
repo I believe contributed by @ovr. 
   
   
https://github.com/apache/arrow-datafusion/blob/eb5a18a427bb718bffbf477c8fdf0230bb0a6242/datafusion-physical-expr/src/unicode_expressions.rs#L413-L441
   
   Another possibility is to add an optional feature flag to arrow-rs for 
"unicode" string support and base the behavior on that flag. But that sounds a 
little over complicated
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on issue #1478: The `substring` kernel panics when chars > U+0x007F

Reply via email to