neilconway commented on PR #20588:
URL: https://github.com/apache/datafusion/pull/20588#issuecomment-3993891789
Alright, I implemented a variant where we do row conversion in chunks of 256
rows. Here are the results on the Hertzner box:
```
group base
target
----- ----
------
array_has_all/all_found_small_needle/10 4.81 6.8±0.04ms
? ?/sec 1.00 1422.9±33.96µs ? ?/sec
array_has_all/all_found_small_needle/100 1.62 16.6±0.04ms
? ?/sec 1.00 10.2±0.03ms ? ?/sec
array_has_all/all_found_small_needle/500 1.19 59.4±0.09ms
? ?/sec 1.00 49.8±0.12ms ? ?/sec
array_has_all/not_all_found/10 5.85 6.5±0.03ms
? ?/sec 1.00 1115.8±9.24µs ? ?/sec
array_has_all/not_all_found/100 1.71 15.0±0.05ms
? ?/sec 1.00 8.8±0.03ms ? ?/sec
array_has_all/not_all_found/500 1.22 52.5±0.11ms
? ?/sec 1.00 43.0±0.09ms ? ?/sec
array_has_all_strings/all_found/10 2.71 5.3±0.03ms
? ?/sec 1.00 1948.9±7.79µs ? ?/sec
array_has_all_strings/all_found/100 1.43 15.8±0.04ms
? ?/sec 1.00 11.1±0.04ms ? ?/sec
array_has_all_strings/all_found/500 1.18 61.0±0.14ms
? ?/sec 1.00 51.6±0.62ms ? ?/sec
array_has_all_strings/not_all_found/10 3.05 4.1±0.02ms
? ?/sec 1.00 1338.3±65.23µs ? ?/sec
array_has_all_strings/not_all_found/100 1.48 14.2±0.08ms
? ?/sec 1.00 9.6±0.05ms ? ?/sec
array_has_all_strings/not_all_found/500 1.23 75.4±0.17ms
? ?/sec 1.00 61.2±0.19ms ? ?/sec
array_has_any/no_match/10 3.46 7.8±0.05ms
? ?/sec 1.00 2.2±0.01ms ? ?/sec
array_has_any/no_match/100 1.35 25.3±0.11ms
? ?/sec 1.00 18.7±0.03ms ? ?/sec
array_has_any/no_match/500 1.14 105.4±0.13ms
? ?/sec 1.00 92.8±2.97ms ? ?/sec
array_has_any/scalar_no_match/10 1.11 2.4±0.01ms
? ?/sec 1.00 2.2±0.01ms ? ?/sec
array_has_any/scalar_no_match/100 1.10 22.9±0.06ms
? ?/sec 1.00 20.8±0.06ms ? ?/sec
array_has_any/scalar_no_match/500 1.06 148.5±0.64ms
? ?/sec 1.00 140.2±1.91ms ? ?/sec
array_has_any/scalar_some_match/10 1.07 1133.4±3.89µs
? ?/sec 1.00 1061.6±4.64µs ? ?/sec
array_has_any/scalar_some_match/100 1.04 11.6±0.16ms
? ?/sec 1.00 11.2±0.08ms ? ?/sec
array_has_any/scalar_some_match/500 1.05 90.9±0.71ms
? ?/sec 1.00 87.0±0.88ms ? ?/sec
array_has_any/some_match/10 5.26 6.6±0.05ms
? ?/sec 1.00 1264.5±3.59µs ? ?/sec
array_has_any/some_match/100 1.60 15.7±0.08ms
? ?/sec 1.00 9.8±0.03ms ? ?/sec
array_has_any/some_match/500 1.17 55.9±0.20ms
? ?/sec 1.00 47.8±0.33ms ? ?/sec
array_has_any_scalar/i64_no_match/1 1.06 396.6±2.17µs
? ?/sec 1.00 372.8±3.30µs ? ?/sec
array_has_any_scalar/i64_no_match/10 1.01 449.7±8.66µs
? ?/sec 1.00 446.0±10.76µs ? ?/sec
array_has_any_scalar/i64_no_match/100 1.02 639.2±20.48µs
? ?/sec 1.00 628.6±17.24µs ? ?/sec
array_has_any_scalar/i64_no_match/1000 1.00 545.1±10.73µs
? ?/sec 1.00 544.1±13.21µs ? ?/sec
array_has_any_scalar/string_no_match/1 1.00 250.5±2.16µs
? ?/sec 1.03 257.9±8.09µs ? ?/sec
array_has_any_scalar/string_no_match/10 1.00 418.3±6.45µs
? ?/sec 1.00 419.4±6.58µs ? ?/sec
array_has_any_scalar/string_no_match/100 1.00 544.9±22.43µs
? ?/sec 1.01 550.0±24.24µs ? ?/sec
array_has_any_scalar/string_no_match/1000 1.00 457.7±8.87µs
? ?/sec 1.00 459.1±6.78µs ? ?/sec
array_has_any_strings/no_match/10 2.12 5.2±0.02ms
? ?/sec 1.00 2.4±0.01ms ? ?/sec
array_has_any_strings/no_match/100 1.21 22.5±0.07ms
? ?/sec 1.00 18.6±0.20ms ? ?/sec
array_has_any_strings/no_match/500 1.11 141.5±0.18ms
? ?/sec 1.00 127.2±0.39ms ? ?/sec
array_has_any_strings/scalar_no_match/10 1.00 861.4±1.90µs
? ?/sec 1.06 909.8±1.83µs ? ?/sec
array_has_any_strings/scalar_no_match/100 1.00 7.4±0.06ms
? ?/sec 1.08 8.0±0.14ms ? ?/sec
array_has_any_strings/scalar_no_match/500 1.02 93.9±0.13ms
? ?/sec 1.00 91.7±0.23ms ? ?/sec
array_has_any_strings/scalar_some_match/10 1.05 827.3±3.93µs
? ?/sec 1.00 788.8±3.78µs ? ?/sec
array_has_any_strings/scalar_some_match/100 1.01 5.2±0.17ms
? ?/sec 1.00 5.1±0.14ms ? ?/sec
array_has_any_strings/scalar_some_match/500 1.00 17.7±0.11ms
? ?/sec 1.04 18.5±0.15ms ? ?/sec
array_has_any_strings/some_match/10 2.56 4.5±0.01ms
? ?/sec 1.00 1758.6±7.71µs ? ?/sec
array_has_any_strings/some_match/100 1.36 14.4±0.07ms
? ?/sec 1.00 10.6±0.06ms ? ?/sec
array_has_any_strings/some_match/500 1.10 54.9±1.41ms
? ?/sec 1.00 50.1±0.20ms ? ?/sec
array_has_i64/found/10 1.00 144.9±4.94µs
? ?/sec 1.02 147.7±4.93µs ? ?/sec
array_has_i64/found/100 1.00 570.5±31.30µs
? ?/sec 1.06 605.6±35.62µs ? ?/sec
array_has_i64/found/500 1.00 4.4±0.15ms
? ?/sec 1.02 4.5±0.12ms ? ?/sec
array_has_i64/not_found/10 1.03 68.8±0.44µs
? ?/sec 1.00 67.0±1.26µs ? ?/sec
array_has_i64/not_found/100 1.02 471.6±27.43µs
? ?/sec 1.00 462.7±22.65µs ? ?/sec
array_has_i64/not_found/500 1.00 4.5±0.11ms
? ?/sec 1.00 4.5±0.11ms ? ?/sec
array_has_strings/found/10 1.10 744.8±5.29µs
? ?/sec 1.00 679.9±5.94µs ? ?/sec
array_has_strings/found/100 1.00 2.7±0.03ms
? ?/sec 1.00 2.7±0.04ms ? ?/sec
array_has_strings/found/500 1.00 15.6±0.21ms
? ?/sec 1.05 16.3±0.35ms ? ?/sec
array_has_strings/not_found/10 1.02 150.5±0.36µs
? ?/sec 1.00 147.0±1.14µs ? ?/sec
array_has_strings/not_found/100 1.11 6.5±0.04ms
? ?/sec 1.00 5.9±0.08ms ? ?/sec
array_has_strings/not_found/500 1.03 16.5±0.04ms
? ?/sec 1.00 16.0±0.07ms ? ?/sec
```
Happily, this seems to address the regressions we saw on large arrays with
the initial approach. Less happily, 256-row chunking performs slightly less
well than full-batch row conversion on my M4 Max machine, although
interestingly the regressions are only for the i64 benchmarks:
```
array_has_all (general/i64):
┌───────────────────┬────────────────────────────────┐
│ Benchmark │ change (chunked vs full-batch) │
├───────────────────┼────────────────────────────────┤
│ all_found/10 │ +9.6% slower │
├───────────────────┼────────────────────────────────┤
│ not_all_found/10 │ +9.0% slower │
├───────────────────┼────────────────────────────────┤
│ all_found/100 │ +9.2% slower │
├───────────────────┼────────────────────────────────┤
│ not_all_found/100 │ +10.0% slower │
├───────────────────┼────────────────────────────────┤
│ all_found/500 │ +5.9% slower │
├───────────────────┼────────────────────────────────┤
│ not_all_found/500 │ +5.5% slower │
└───────────────────┴────────────────────────────────┘
array_has_any (general/i64):
┌────────────────┬────────────────────────────────┐
│ Benchmark │ change (chunked vs full-batch) │
├────────────────┼────────────────────────────────┤
│ some_match/10 │ +4.4% slower │
├────────────────┼────────────────────────────────┤
│ no_match/10 │ +3.4% slower │
├────────────────┼────────────────────────────────┤
│ some_match/100 │ +4.4% slower │
├────────────────┼────────────────────────────────┤
│ no_match/100 │ +4.0% slower │
├────────────────┼────────────────────────────────┤
│ some_match/500 │ +2.8% slower │
├────────────────┼────────────────────────────────┤
│ no_match/500 │ +2.4% slower │
└────────────────┴────────────────────────────────┘
```
The string benchmarks were much closer and basically in the noise.
Avoiding the regressions on large arrays seems worth the small performance
hit on M4 machines, but it's probably worth exploring a bigger chunk size and
seeing if that helps at all.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]