[GitHub] [arrow-datafusion] Dandandan opened a new issue, #3049: Simplify / speed up implementation of character_length to unicode points

GitBox Fri, 05 Aug 2022 14:59:26 -0700


Dandandan opened a new issue, #3049:
URL: https://github.com/apache/arrow-datafusion/issues/3049


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   It looks like postgresql and spark have a different, (probably much faster) 
implementation of calculating the string length, which doesn't depend on 
calculating the grapheme clusters, but on utf8 code points.
   
   E.g. see this example:
   
   `select length('ä')`
   PostgreSQL | Spark | DataFusion
   | Postgres | Spark | DataFusion |
   |----------|-------|------------|
   | 2        | 2     | 1          |
   
   **Describe the solution you'd like**
   Probably `.chars().count()` is a faster solution.
   
   **Describe alternatives you've considered**
   Accepting our own implementation being slower but "superior" to other 
solutions.
   
   **Additional context**
   This came up as being quite slow when profiling this benchmark: 
https://github.com/DataPsycho/data-pipelines-in-rust/tree/main/amazon_review_pipeline
 
   We're using the `grapheme` option in other places as well, maybe here we 
could use the faster 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan opened a new issue, #3049: Simplify / speed up implementation of character_length to unicode points

Reply via email to