neilconway opened a new pull request, #21405:
URL: https://github.com/apache/datafusion/pull/21405
## Which issue does this PR close?
- Closes #21060.
## Rationale for this change
`lpad`, `rpad`, and `translate` use grapheme segmentation. This is
inconsistent with how these functions behave in Postgres and DuckDB, as well as
the SQL standard -- segmentation based on Unicode codepoints is used instead.
It also happens that grapheme-based segmentation is significantly more
expensive than codepoint-based segmentation.
In the case of `lpad` and `rpad`, graphemes and codepoints were used
inconsistently: the input string was measured in code points but the fill
string was measured in graphemes.
Benchmarks (M4 Max):
lpad size=1024:
- lpad utf8 [str_len=5, target=20]: 12.4 µs → 12.8 µs, +3.0%
- lpad stringview [str_len=5, target=20]: 11.5 µs → 11.7 µs, +1.4%
- lpad utf8 [str_len=20, target=50]: 11.3 µs → 11.3 µs, +0.1%
- lpad stringview [str_len=20, target=50]: 11.8 µs → 12.0 µs, +1.6%
- lpad utf8 unicode [target=20]: 98.4 µs → 24.4 µs, -75.1%
- lpad stringview unicode [target=20]: 99.8 µs → 26.0 µs, -74.0%
- lpad utf8 scalar [str_len=5, target=20, fill='x']: 8.7 µs → 8.8 µs, +1.0%
- lpad stringview scalar [str_len=5, target=20, fill='x']: 10.2 µs → 10.1
µs, -0.1%
- lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 44.7 µs →
10.9 µs, -75.7%
- lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 152.5 µs →
11.7 µs, -92.3%
lpad size=4096:
- lpad utf8 [str_len=5, target=20]: 55.9 µs → 55.1 µs, -1.4%
- lpad stringview [str_len=5, target=20]: 49.2 µs → 50.1 µs, +1.8%
- lpad utf8 [str_len=20, target=50]: 46.6 µs → 46.4 µs, -0.5%
- lpad stringview [str_len=20, target=50]: 47.5 µs → 48.5 µs, +2.1%
- lpad utf8 unicode [target=20]: 401.3 µs → 100.1 µs, -75.0%
- lpad stringview unicode [target=20]: 397.7 µs → 104.9 µs, -73.6%
- lpad utf8 scalar [str_len=5, target=20, fill='x']: 34.2 µs → 35.0 µs,
+2.4%
- lpad stringview scalar [str_len=5, target=20, fill='x']: 40.1 µs → 40.4
µs, +0.6%
- lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 178.3 µs →
42.9 µs, -76.0%
- lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 601.3 µs →
46.2 µs, -92.3%
rpad size=1024:
- rpad utf8 [str_len=5, target=20]: 15.5 µs → 14.4 µs, -7.1%
- rpad stringview [str_len=5, target=20]: 13.8 µs → 14.0 µs, +1.7%
- rpad utf8 [str_len=20, target=50]: 12.6 µs → 12.7 µs, +1.3%
- rpad stringview [str_len=20, target=50]: 13.0 µs → 13.1 µs, +0.7%
- rpad utf8 unicode [target=20]: 103.5 µs → 26.0 µs, -74.8%
- rpad stringview unicode [target=20]: 101.2 µs → 27.6 µs, -72.7%
- rpad utf8 scalar [str_len=5, target=20, fill='x']: 11.4 µs → 10.9 µs,
-3.9%
- rpad stringview scalar [str_len=5, target=20, fill='x']: 12.2 µs → 12.6
µs, +2.8%
- rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 46.3 µs →
12.4 µs, -73.1%
- rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 155.6 µs →
11.6 µs, -92.4%
rpad size=4096:
- rpad utf8 [str_len=5, target=20]: 70.1 µs → 61.6 µs, -12.2%
- rpad stringview [str_len=5, target=20]: 60.4 µs → 56.8 µs, -6.0%
- rpad utf8 [str_len=20, target=50]: 50.6 µs → 51.2 µs, +1.2%
- rpad stringview [str_len=20, target=50]: 53.7 µs → 53.3 µs, -0.8%
- rpad utf8 unicode [target=20]: 407.1 µs → 104.0 µs, -74.5%
- rpad stringview unicode [target=20]: 404.8 µs → 114.5 µs, -71.7%
- rpad utf8 scalar [str_len=5, target=20, fill='x']: 47.5 µs → 45.6 µs,
-4.0%
- rpad stringview scalar [str_len=5, target=20, fill='x']: 56.4 µs → 58.5
µs, +3.6%
- rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 184.1 µs →
48.1 µs, -73.9%
- rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 606.4 µs →
45.6 µs, -92.5%
translate size=1024:
- array_from_to [str_len=8]: 140.0 µs → 37.6 µs, -73.2%
- scalar_from_to [str_len=8]: 9.0 µs → 8.8 µs, -2.7%
- array_from_to [str_len=32]: 371.3 µs → 65.6 µs, -82.3%
- scalar_from_to [str_len=32]: 19.9 µs → 19.2 µs, -3.6%
- array_from_to [str_len=128]: 1249.6 µs → 188.7 µs, -84.9%
- scalar_from_to [str_len=128]: 70.2 µs → 64.7 µs, -7.9%
- array_from_to [str_len=1024]: 9349.4 µs → 1378.1 µs, -85.3%
- scalar_from_to [str_len=1024]: 506.5 µs → 445.8 µs, -12.0%
translate size=4096:
- array_from_to [str_len=8]: 548.0 µs → 147.1 µs, -73.2%
- scalar_from_to [str_len=8]: 33.9 µs → 32.8 µs, -3.1%
- array_from_to [str_len=32]: 1457.2 µs → 266.0 µs, -81.7%
- scalar_from_to [str_len=32]: 78.0 µs → 75.5 µs, -3.2%
- array_from_to [str_len=128]: 4935.0 µs → 791.1 µs, -84.0%
- scalar_from_to [str_len=128]: 278.2 µs → 260.7 µs, -6.3%
- array_from_to [str_len=1024]: 37496 µs → 5591 µs, -85.1%
- scalar_from_to [str_len=1024]: 2058.0 µs → 1770 µs, -14.0%
## What changes are included in this PR?
* Switch from grapheme segmentation to codepoint segmentation for `lpad`,
`rpad`, and `translate`
* Add SLT tests
* Refactor a few helper functions
* Remove dependency on `unicode_segmentation` crate as it is no longer used
## Are these changes tested?
Yes.
## Are there any user-facing changes?
Yes. This PR changes the behavior of `lpad`, `rpad`, and `translate`,
although the new behavior is more consistent with
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]