samuelcolvin opened a new pull request, #6118:
URL: https://github.com/apache/arrow-rs/pull/6118
# Which issue does this PR close?
Related to (but not closing) #6107.
# Rationale for this change
Lots of context in #6107, this makes `LIKE` and `ILIKE` queries which are
simply "starts with" and "ends with" significantly faster.
Running
```bash
cargo bench -p arrow --bench comparison_kernels -F test_utils -- like
```
Gives notably:
| Test | Change |
|--------------------------------|-----------|
| like_utf8 scalar ends with | -53.921% |
| like_utf8 scalar starts with | -53.883% |
| like_utf8view scalar ends with | -26.079% |
| like_utf8view scalar starts with| -24.864% |
| nlike_utf8 scalar ends with | -53.944% |
| nlike_utf8 scalar starts with | -53.610% |
| ilike_utf8 scalar starts with | -21.921% |
| nilike_utf8 scalar starts with | -22.727% |
<details>
<summary>Full output</summary>
```
Compiling arrow-string v52.1.0 (/Users/samuel/code/arrow-rs/arrow-string)
Compiling arrow v52.1.0 (/Users/samuel/code/arrow-rs/arrow)
Finished `bench` profile [optimized] target(s) in 4.13s
Running benches/comparison_kernels.rs
(target/release/deps/comparison_kernels-b61fe744923f27b6)
like_utf8 scalar equals time: [144.33 µs 144.47 µs 144.69 µs]
change: [-0.2401% -0.0477% +0.1358%] (p = 0.63 >
0.05)
No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
5 (5.00%) high mild
4 (4.00%) high severe
like_utf8 scalar contains
time: [195.98 µs 196.10 µs 196.23 µs]
change: [-0.2527% -0.0384% +0.1985%] (p = 0.75 >
0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
like_utf8 scalar ends with
time: [66.456 µs 66.538 µs 66.628 µs]
change: [-53.995% -53.921% -53.817%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
like_utf8 scalar starts with
time: [67.058 µs 67.093 µs 67.125 µs]
change: [-54.038% -53.883% -53.755%] (p = 0.00 <
0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
like_utf8 scalar complex
time: [124.83 µs 124.89 µs 124.96 µs]
change: [-2.2102% -1.9432% -1.6682%] (p = 0.00 <
0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
like_utf8view scalar equals
time: [15.993 ms 16.022 ms 16.052 ms]
change: [-0.1739% +0.0490% +0.2773%] (p = 0.68 >
0.05)
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking like_utf8view scalar contains: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 18.1s, or reduce sample count to 20.
like_utf8view scalar contains
time: [179.98 ms 180.36 ms 180.76 ms]
change: [-0.9752% -0.6832% -0.3939%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 23 outliers among 100 measurements (23.00%)
9 (9.00%) low mild
7 (7.00%) high mild
7 (7.00%) high severe
like_utf8view scalar ends with
time: [21.398 ms 21.473 ms 21.552 ms]
change: [-26.453% -26.079% -25.713%] (p = 0.00 <
0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
like_utf8view scalar starts with
time: [21.710 ms 21.790 ms 21.889 ms]
change: [-25.151% -24.864% -24.473%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
Benchmarking like_utf8view scalar complex: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 23.3s, or reduce sample count to 20.
like_utf8view scalar complex
time: [231.47 ms 231.93 ms 232.49 ms]
change: [+0.6295% +0.8893% +1.1705%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
nlike_utf8 scalar equals
time: [145.37 µs 145.53 µs 145.70 µs]
change: [-0.0855% +0.1646% +0.3920%] (p = 0.18 >
0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
nlike_utf8 scalar contains
time: [197.16 µs 197.41 µs 197.71 µs]
change: [-0.7353% -0.3336% +0.0168%] (p = 0.09 >
0.05)
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
nlike_utf8 scalar ends with
time: [67.028 µs 67.212 µs 67.443 µs]
change: [-54.194% -53.944% -53.720%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
nlike_utf8 scalar starts with
time: [67.260 µs 67.327 µs 67.396 µs]
change: [-53.731% -53.610% -53.498%] (p = 0.00 <
0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
nlike_utf8 scalar complex
time: [125.99 µs 126.10 µs 126.23 µs]
change: [-1.2656% -1.0550% -0.8713%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
ilike_utf8 scalar equals
time: [103.50 µs 103.64 µs 103.79 µs]
change: [+2.2061% +2.5400% +2.8621%] (p = 0.00 <
0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
ilike_utf8 scalar contains
time: [664.73 µs 665.66 µs 666.72 µs]
change: [+0.4162% +0.6823% +0.9311%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) high mild
1 (1.00%) high severe
ilike_utf8 scalar ends with
time: [104.73 µs 104.87 µs 105.01 µs]
change: [-0.0506% +0.2329% +0.5070%] (p = 0.11 >
0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild
ilike_utf8 scalar starts with
time: [93.878 µs 94.077 µs 94.287 µs]
change: [-22.114% -21.921% -21.735%] (p = 0.00 <
0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
ilike_utf8 scalar complex
time: [136.51 µs 136.59 µs 136.68 µs]
change: [-0.8583% -0.6439% -0.4482%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
nilike_utf8 scalar equals
time: [103.23 µs 103.56 µs 104.04 µs]
change: [+2.3182% +2.5182% +2.7169%] (p = 0.00 <
0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
nilike_utf8 scalar contains
time: [664.85 µs 665.41 µs 665.99 µs]
change: [+0.9991% +1.1622% +1.3429%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
nilike_utf8 scalar ends with
time: [104.22 µs 104.36 µs 104.50 µs]
change: [-0.0951% +0.1088% +0.3029%] (p = 0.31 >
0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) low mild
4 (4.00%) high mild
nilike_utf8 scalar starts with
time: [94.197 µs 94.387 µs 94.575 µs]
change: [-23.029% -22.727% -22.441%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
nilike_utf8 scalar complex
time: [136.32 µs 136.47 µs 136.63 µs]
change: [-2.8598% -2.4438% -2.0479%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
like_utf8_scalar_dyn dictionary[10] string[4])
time: [29.504 µs 29.550 µs 29.603 µs]
change: [+1.1570% +1.4762% +1.8389%] (p = 0.00 <
0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe
ilike_utf8_scalar_dyn dictionary[10] string[4])
time: [29.404 µs 29.436 µs 29.471 µs]
change: [+0.1443% +0.5140% +0.8086%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
```
</details>
# What changes are included in this PR?
* new implementation of `starts_with_ignore_ascii_case` and
`ends_with_ignore_ascii_case`, these showed significant improvements (~20%)
over the previous implementations
* new methods `crate::predicate::starts_with` and
`crate::predicate::ends_with` that show a 2 or 3x improvement over
`str.starts_with` and `str.ends_with`
# Are there any user-facing changes?
Shouldn't be. I fuzzed all the implementations against the default methods
[here](https://github.com/samuelcolvin/quick-strings/blob/main/fuzz/fuzz_targets/fuzz_all.rs)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]