Dandandan opened a new issue, #3049:
URL: https://github.com/apache/arrow-datafusion/issues/3049
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
It looks like postgresql and spark have a different, (probably much faster)
implementation of calculating the string length, which doesn't depend on
calculating the grapheme clusters, but on utf8 code points.
E.g. see this example:
`select length('ä')`
PostgreSQL | Spark | DataFusion
| Postgres | Spark | DataFusion |
|----------|-------|------------|
| 2 | 2 | 1 |
**Describe the solution you'd like**
Probably `.chars().count()` is a faster solution.
**Describe alternatives you've considered**
Accepting our own implementation being slower but "superior" to other
solutions.
**Additional context**
This came up as being quite slow when profiling this benchmark:
https://github.com/DataPsycho/data-pipelines-in-rust/tree/main/amazon_review_pipeline
We're using the `grapheme` option in other places as well, maybe here we
could use the faster
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]