neilconway opened a new pull request, #20278:
URL: https://github.com/apache/datafusion/pull/20278
The previous implementation incurred the overhead of Unicode machinery, even
for the common case that both the input string and the fill string consistent
only of ASCII characters. For the ASCII-only case, we can assume that the
length in bytes equals the length in characters, and avoid expensive
graphene-based segmentation. This follows similar optimizations applied
elsewhere in the codebase.
Benchmarks indicate this is a significant performance win for ASCII-only
input (4x-10x faster) but only a mild regression for Unicode input (2-5%
slower).
Along the way:
* Combine: a few instances of `write_str(str)? + append_value("")` with
`append_value(str)`, which saves a few cycles
* Add a missing test case for truncating the input string
* Add benchmarks for Unicode input
## Which issue does this PR close?
- Closes #20277.
## Are these changes tested?
Covered by existing tests. Added new benchmarks for Unicode inputs.
## Are there any user-facing changes?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]