Re: [PR] Improve `LIKE` performance for "contains" style queries [arrow-rs]

via GitHub Sun, 28 Jul 2024 06:07:19 -0700


samuelcolvin commented on PR #6128:
URL: https://github.com/apache/arrow-rs/pull/6128#issuecomment-2254513573


   The problem here I think is that the haystack length is only [4 
characters](https://github.com/apache/arrow-rs/blob/5f5a82cb388094ea2a54f313b5a43249613195aa/arrow/src/util/bench_util.rs#L116)!
 While of course some haystacks will be that short, I personally don't think 
that's the only case we should consider.
   
   Running
   
   ```bash
   cargo bench -p arrow --bench comparison_kernels -F test_utils -- 
--save-baseline `git rev-parse --abbrev-ref HEAD` '^(like|nlike).*scalar 
contains'
   git checkout contains-performance
   cargo bench -p arrow --bench comparison_kernels -F test_utils -- 
--save-baseline `git rev-parse --abbrev-ref HEAD` '^(like|nlike).*scalar 
contains'
   critcmp master contains-performance
   ```
   
   With different haystack lengths gives very different results from those 
shown above:
   
   ### With haystack length=10
   
   ```
   group                            contains-performance                   
master
   -----                            --------------------                   
------
   like_utf8 scalar contains        1.00    375.5±4.54µs        ? ?/sec    2.43 
   912.8±6.32µs        ? ?/sec
   like_utf8view scalar contains    1.00     89.6±1.94ms        ? ?/sec    2.01 
   179.8±1.80ms        ? ?/sec
   nlike_utf8 scalar contains       1.00    374.3±2.67µs        ? ?/sec    2.43 
   911.1±1.95µs        ? ?/sec
   ```
   
   ### With haystack length=20
   
   ```
   group                            contains-performance                   
master
   -----                            --------------------                   
------
   like_utf8 scalar contains        1.00    229.6±1.50µs        ? ?/sec    4.46 
  1024.3±6.95µs        ? ?/sec
   like_utf8view scalar contains    1.00     88.9±0.79ms        ? ?/sec    2.07 
   184.4±2.52ms        ? ?/sec
   nlike_utf8 scalar contains       1.00    230.6±3.77µs        ? ?/sec    4.44 
  1025.1±6.38µs        ? ?/sec
   ```
   
   ### With haystack length=100
   
   ```
   group                            contains-performance                   
master
   -----                            --------------------                   
------
   like_utf8 scalar contains        1.00    377.3±2.50µs        ? ?/sec    4.51 
 1701.4±17.41µs        ? ?/sec
   like_utf8view scalar contains    1.00     89.4±0.39ms        ? ?/sec    2.02 
   180.6±2.29ms        ? ?/sec
   nlike_utf8 scalar contains       1.00    376.6±1.74µs        ? ?/sec    4.58 
 1724.9±15.14µs        ? ?/sec
   ```
   
   ### With random haystack length=0..400
   
   **IMHO this is by far the most representative benchmark** (happy to submit a 
PR to make this change permanent if you agree).
   
   Using the following in `create_string_array_with_len`
   
   ```rs
                   let length = rng.gen_range(0..max_str_len);
                   let value = 
rng.sample_iter(&Alphanumeric).take(length).collect();
   ```
   
   ```
   group                            contains-performance                   
master
   -----                            --------------------                   
------
   like_utf8 scalar contains        1.00   882.0±29.67µs        ? ?/sec    3.33 
     2.9±0.08ms        ? ?/sec
   like_utf8view scalar contains    1.00     89.8±2.13ms        ? ?/sec    2.04 
   182.9±3.11ms        ? ?/sec
   nlike_utf8 scalar contains       1.00   867.1±11.87µs        ? ?/sec    3.39 
     2.9±0.06ms        ? ?/sec
   ```
   
   ---
   
   These benchmarks were all run on my M3 mac, but that's pretty consistent 
with the GCP VM. I ran the last benchmark ("With random haystack 
length=0..400") on GCP to confirm:
   
   ```
   group                            contains-performance                   
master
   -----                            --------------------                   
------
   like_utf8 scalar contains        1.00  1661.6±16.26µs        ? ?/sec    3.22 
     5.4±0.02ms        ? ?/sec
   like_utf8view scalar contains    1.00    132.4±0.21ms        ? ?/sec    2.92 
   386.5±0.46ms        ? ?/sec
   nlike_utf8 scalar contains       1.00  1659.9±14.95µs        ? ?/sec    3.22 
     5.3±0.01ms        ? ?/sec
   ```
   
   ---
   
   ## Summary
   
   I guess we could do some special casing in the contains logic to use 
`str.contains` for very short haystacks, but personally I don't think it's 
worth it — it would make the kernel a bit slower in all cases.
   
   Personally I think this change is good as is and we should update the 
benchmark code to use more representative haystacks.
   
   LMK what you think.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Improve `LIKE` performance for "contains" style queries [arrow-rs]

Reply via email to