samuelcolvin opened a new pull request, #6131:
URL: https://github.com/apache/arrow-rs/pull/6131

   # Which issue does this PR close?
   
   This continues work from #6118 and #6128 and will require them to be merged 
before it's fixed, it's the final part of #6107 I think.
   
   # Rationale for this change
    
   Lots of context in #6107, this makes `ILIKE` queries which are simply 
"contains" significantly faster.
   
   Benchmarks:
   
   ```bash
   ➤ cargo bench -p arrow --bench comparison_kernels -F test_utils -- 
'ilike.*contains'
      Compiling arrow-string v52.2.0 (/Users/samuel/code/arrow-rs/arrow-string)
      Compiling arrow v52.2.0 (/Users/samuel/code/arrow-rs/arrow)
       Finished `bench` profile [optimized] target(s) in 3.99s
        Running benches/comparison_kernels.rs 
(target/release/deps/comparison_kernels-95ab196215ed59e6)
   ilike_utf8 scalar contains
                           time:   [125.84 µs 125.89 µs 125.95 µs]
                           change: [-80.870% -80.819% -80.764%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     2 (2.00%) high mild
     6 (6.00%) high severe
   
   nilike_utf8 scalar contains
                           time:   [125.79 µs 126.06 µs 126.49 µs]
                           change: [-80.842% -80.809% -80.776%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 15 outliers among 100 measurements (15.00%)
     2 (2.00%) low mild
     1 (1.00%) high mild
     12 (12.00%) high severe
   ```
   
   Note, in theory the current regex approach could be faster for very large 
haystacks, but from my experiments that case will be quite rare, and in many 
cases the regex will be far far slower, see 
https://github.com/samuelcolvin/quick-strings/pull/1:
   
   ```
   prefix-length=10    needle-length=10    — icontains=10.600 ns regex=23.766 ns
   prefix-length=100   needle-length=10    — icontains=54.671 ns regex=122.80 ns
   prefix-length=1_000 needle-length=10    — icontains=494.05 ns regex=820.21 ns
   prefix-length=2_000 needle-length=10    — icontains=1.0769 µs regex=1.1335 µs
   prefix-length=5_000 needle-length=10    — icontains=2.6704 µs regex=2.0020 µs
   prefix-length=10    needle-length=1_000 — icontains=564.89 ns regex=1.5671 µs
   prefix-length=10    needle-length=2_000 — icontains=1.1176 µs regex=11.945 ms
   prefix-length=10    needle-length=5_000 — icontains=2.7539 µs regex=74.793 ms
   ```
   
   I therefore think it's better to stick to a single implementation than have 
a branch for very large haystacks.
   
   # What changes are included in this PR?
   
   New special case for "case insensitive ascii contains".
   
   # Are there any user-facing changes?
   
   No AFAIK,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to