[PR] improvements to `(i)starts_with` and `(i)ends_with` [arrow-rs]

via GitHub Thu, 25 Jul 2024 11:58:59 -0700


samuelcolvin opened a new pull request, #6118:
URL: https://github.com/apache/arrow-rs/pull/6118


   # Which issue does this PR close?
   
   Related to (but not closing) #6107.
   
   # Rationale for this change
    
   Lots of context in #6107, this makes `LIKE` and `ILIKE` queries which are 
simply "starts with" and "ends with" significantly faster.
   
   Running
   
   ```bash
   cargo bench -p arrow --bench comparison_kernels -F test_utils -- like
   ```
   
   Gives notably:
   
   | Test                           | Change    |
   |--------------------------------|-----------|
   | like_utf8 scalar ends with     | -53.921%  |
   | like_utf8 scalar starts with   | -53.883%  |
   | like_utf8view scalar ends with | -26.079%  |
   | like_utf8view scalar starts with| -24.864%  |
   | nlike_utf8 scalar ends with    | -53.944%  |
   | nlike_utf8 scalar starts with  | -53.610%  |
   | ilike_utf8 scalar starts with  | -21.921%  |
   | nilike_utf8 scalar starts with | -22.727%  |
   
   <details>
   <summary>Full output</summary>
   
   ```
      Compiling arrow-string v52.1.0 (/Users/samuel/code/arrow-rs/arrow-string)
      Compiling arrow v52.1.0 (/Users/samuel/code/arrow-rs/arrow)
       Finished `bench` profile [optimized] target(s) in 4.13s
        Running benches/comparison_kernels.rs 
(target/release/deps/comparison_kernels-b61fe744923f27b6)
   like_utf8 scalar equals time:   [144.33 µs 144.47 µs 144.69 µs]
                           change: [-0.2401% -0.0477% +0.1358%] (p = 0.63 > 
0.05)
                           No change in performance detected.
   Found 10 outliers among 100 measurements (10.00%)
     1 (1.00%) low mild
     5 (5.00%) high mild
     4 (4.00%) high severe
   
   like_utf8 scalar contains
                           time:   [195.98 µs 196.10 µs 196.23 µs]
                           change: [-0.2527% -0.0384% +0.1985%] (p = 0.75 > 
0.05)
                           No change in performance detected.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) low severe
     1 (1.00%) low mild
     1 (1.00%) high mild
     2 (2.00%) high severe
   
   like_utf8 scalar ends with
                           time:   [66.456 µs 66.538 µs 66.628 µs]
                           change: [-53.995% -53.921% -53.817%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     4 (4.00%) high mild
     1 (1.00%) high severe
   
   like_utf8 scalar starts with
                           time:   [67.058 µs 67.093 µs 67.125 µs]
                           change: [-54.038% -53.883% -53.755%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) high mild
     2 (2.00%) high severe
   
   like_utf8 scalar complex
                           time:   [124.83 µs 124.89 µs 124.96 µs]
                           change: [-2.2102% -1.9432% -1.6682%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     2 (2.00%) high mild
     6 (6.00%) high severe
   
   like_utf8view scalar equals
                           time:   [15.993 ms 16.022 ms 16.052 ms]
                           change: [-0.1739% +0.0490% +0.2773%] (p = 0.68 > 
0.05)
                           No change in performance detected.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   Benchmarking like_utf8view scalar contains: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 18.1s, or reduce sample count to 20.
   like_utf8view scalar contains
                           time:   [179.98 ms 180.36 ms 180.76 ms]
                           change: [-0.9752% -0.6832% -0.3939%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 23 outliers among 100 measurements (23.00%)
     9 (9.00%) low mild
     7 (7.00%) high mild
     7 (7.00%) high severe
   
   like_utf8view scalar ends with
                           time:   [21.398 ms 21.473 ms 21.552 ms]
                           change: [-26.453% -26.079% -25.713%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   like_utf8view scalar starts with
                           time:   [21.710 ms 21.790 ms 21.889 ms]
                           change: [-25.151% -24.864% -24.473%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   Benchmarking like_utf8view scalar complex: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 23.3s, or reduce sample count to 20.
   like_utf8view scalar complex
                           time:   [231.47 ms 231.93 ms 232.49 ms]
                           change: [+0.6295% +0.8893% +1.1705%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 7 outliers among 100 measurements (7.00%)
     6 (6.00%) high mild
     1 (1.00%) high severe
   
   nlike_utf8 scalar equals
                           time:   [145.37 µs 145.53 µs 145.70 µs]
                           change: [-0.0855% +0.1646% +0.3920%] (p = 0.18 > 
0.05)
                           No change in performance detected.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   nlike_utf8 scalar contains
                           time:   [197.16 µs 197.41 µs 197.71 µs]
                           change: [-0.7353% -0.3336% +0.0168%] (p = 0.09 > 
0.05)
                           No change in performance detected.
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   nlike_utf8 scalar ends with
                           time:   [67.028 µs 67.212 µs 67.443 µs]
                           change: [-54.194% -53.944% -53.720%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   nlike_utf8 scalar starts with
                           time:   [67.260 µs 67.327 µs 67.396 µs]
                           change: [-53.731% -53.610% -53.498%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) high mild
     2 (2.00%) high severe
   
   nlike_utf8 scalar complex
                           time:   [125.99 µs 126.10 µs 126.23 µs]
                           change: [-1.2656% -1.0550% -0.8713%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   ilike_utf8 scalar equals
                           time:   [103.50 µs 103.64 µs 103.79 µs]
                           change: [+2.2061% +2.5400% +2.8621%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   
   ilike_utf8 scalar contains
                           time:   [664.73 µs 665.66 µs 666.72 µs]
                           change: [+0.4162% +0.6823% +0.9311%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 8 outliers among 100 measurements (8.00%)
     7 (7.00%) high mild
     1 (1.00%) high severe
   
   ilike_utf8 scalar ends with
                           time:   [104.73 µs 104.87 µs 105.01 µs]
                           change: [-0.0506% +0.2329% +0.5070%] (p = 0.11 > 
0.05)
                           No change in performance detected.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) low mild
     2 (2.00%) high mild
   
   ilike_utf8 scalar starts with
                           time:   [93.878 µs 94.077 µs 94.287 µs]
                           change: [-22.114% -21.921% -21.735%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   ilike_utf8 scalar complex
                           time:   [136.51 µs 136.59 µs 136.68 µs]
                           change: [-0.8583% -0.6439% -0.4482%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   nilike_utf8 scalar equals
                           time:   [103.23 µs 103.56 µs 104.04 µs]
                           change: [+2.3182% +2.5182% +2.7169%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   nilike_utf8 scalar contains
                           time:   [664.85 µs 665.41 µs 665.99 µs]
                           change: [+0.9991% +1.1622% +1.3429%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 6 outliers among 100 measurements (6.00%)
     3 (3.00%) high mild
     3 (3.00%) high severe
   
   nilike_utf8 scalar ends with
                           time:   [104.22 µs 104.36 µs 104.50 µs]
                           change: [-0.0951% +0.1088% +0.3029%] (p = 0.31 > 
0.05)
                           No change in performance detected.
   Found 6 outliers among 100 measurements (6.00%)
     2 (2.00%) low mild
     4 (4.00%) high mild
   
   nilike_utf8 scalar starts with
                           time:   [94.197 µs 94.387 µs 94.575 µs]
                           change: [-23.029% -22.727% -22.441%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   nilike_utf8 scalar complex
                           time:   [136.32 µs 136.47 µs 136.63 µs]
                           change: [-2.8598% -2.4438% -2.0479%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     2 (2.00%) high mild
     3 (3.00%) high severe
   
   like_utf8_scalar_dyn dictionary[10] string[4])
                           time:   [29.504 µs 29.550 µs 29.603 µs]
                           change: [+1.1570% +1.4762% +1.8389%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high severe
   
   ilike_utf8_scalar_dyn dictionary[10] string[4])
                           time:   [29.404 µs 29.436 µs 29.471 µs]
                           change: [+0.1443% +0.5140% +0.8086%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) high mild
     2 (2.00%) high severe
   ```
   
   </details>
   
   # What changes are included in this PR?
   
   * new implementation of `starts_with_ignore_ascii_case` and 
`ends_with_ignore_ascii_case`, these showed significant improvements (~20%) 
over the previous implementations
   * new methods `crate::predicate::starts_with` and 
`crate::predicate::ends_with` that show a 2 or 3x improvement over 
`str.starts_with` and `str.ends_with`
   
   # Are there any user-facing changes?
   
   Shouldn't be. I fuzzed all the implementations against the default methods 
[here](https://github.com/samuelcolvin/quick-strings/blob/main/fuzz/fuzz_targets/fuzz_all.rs)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] improvements to `(i)starts_with` and `(i)ends_with` [arrow-rs]

Reply via email to