samuelcolvin commented on PR #6128: URL: https://github.com/apache/arrow-rs/pull/6128#issuecomment-2254513573
The problem here I think is that the haystack length is only [4 characters](https://github.com/apache/arrow-rs/blob/5f5a82cb388094ea2a54f313b5a43249613195aa/arrow/src/util/bench_util.rs#L116)! While of course some haystacks will be that short, I personally don't think that's the only case we should consider. Running ```bash cargo bench -p arrow --bench comparison_kernels -F test_utils -- --save-baseline `git rev-parse --abbrev-ref HEAD` '^(like|nlike).*scalar contains' git checkout contains-performance cargo bench -p arrow --bench comparison_kernels -F test_utils -- --save-baseline `git rev-parse --abbrev-ref HEAD` '^(like|nlike).*scalar contains' critcmp master contains-performance ``` With different haystack lengths gives very different results from those shown above: ### With haystack length=10 ``` group contains-performance master ----- -------------------- ------ like_utf8 scalar contains 1.00 375.5±4.54µs ? ?/sec 2.43 912.8±6.32µs ? ?/sec like_utf8view scalar contains 1.00 89.6±1.94ms ? ?/sec 2.01 179.8±1.80ms ? ?/sec nlike_utf8 scalar contains 1.00 374.3±2.67µs ? ?/sec 2.43 911.1±1.95µs ? ?/sec ``` ### With haystack length=20 ``` group contains-performance master ----- -------------------- ------ like_utf8 scalar contains 1.00 229.6±1.50µs ? ?/sec 4.46 1024.3±6.95µs ? ?/sec like_utf8view scalar contains 1.00 88.9±0.79ms ? ?/sec 2.07 184.4±2.52ms ? ?/sec nlike_utf8 scalar contains 1.00 230.6±3.77µs ? ?/sec 4.44 1025.1±6.38µs ? ?/sec ``` ### With haystack length=100 ``` group contains-performance master ----- -------------------- ------ like_utf8 scalar contains 1.00 377.3±2.50µs ? ?/sec 4.51 1701.4±17.41µs ? ?/sec like_utf8view scalar contains 1.00 89.4±0.39ms ? ?/sec 2.02 180.6±2.29ms ? ?/sec nlike_utf8 scalar contains 1.00 376.6±1.74µs ? ?/sec 4.58 1724.9±15.14µs ? ?/sec ``` ### With random haystack length=0..400 **IMHO this is by far the most representative benchmark** (happy to submit a PR to make this change permanent if you agree). Using the following in `create_string_array_with_len` ```rs let length = rng.gen_range(0..max_str_len); let value = rng.sample_iter(&Alphanumeric).take(length).collect(); ``` ``` group contains-performance master ----- -------------------- ------ like_utf8 scalar contains 1.00 882.0±29.67µs ? ?/sec 3.33 2.9±0.08ms ? ?/sec like_utf8view scalar contains 1.00 89.8±2.13ms ? ?/sec 2.04 182.9±3.11ms ? ?/sec nlike_utf8 scalar contains 1.00 867.1±11.87µs ? ?/sec 3.39 2.9±0.06ms ? ?/sec ``` --- These benchmarks were all run on my M3 mac, but that's pretty consistent with the GCP VM. I ran the last benchmark ("With random haystack length=0..400") on GCP to confirm: ``` group contains-performance master ----- -------------------- ------ like_utf8 scalar contains 1.00 1661.6±16.26µs ? ?/sec 3.22 5.4±0.02ms ? ?/sec like_utf8view scalar contains 1.00 132.4±0.21ms ? ?/sec 2.92 386.5±0.46ms ? ?/sec nlike_utf8 scalar contains 1.00 1659.9±14.95µs ? ?/sec 3.22 5.3±0.01ms ? ?/sec ``` --- ## Summary I guess we could do some special casing in the contains logic to use `str.contains` for very short haystacks, but personally I don't think it's worth it — it would make the kernel a bit slower in all cases. Personally I think this change is good as is and we should update the benchmark code to use more representative haystacks. LMK what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
