neilconway opened a new pull request, #21238:
URL: https://github.com/apache/datafusion/pull/21238
## Which issue does this PR close?
- Closes #21204.
## Rationale for this change
In practice, `split_part` is usually invoked with constant values for
`delimiter` and `position`. We can take advantage of that to hoist some per-row
checks out of the hot loop; more importantly, we can switch from using per-row
`str::split` to building one `memchr::memmem::Finder` and using it for each
row. Building a `Finder` is relatively expensive but it's a clear win when we
can amortize that upfront cost over an entire input batch.
Benchmarks (M4 Max):
- `scalar_utf8_single_char/pos_first`: 105 µs → 41 µs, -61%
- `scalar_utf8_single_char/pos_middle`: 358 µs → 97 µs, -73%
- `scalar_utf8_single_char/pos_negative`: 110 µs → 46 µs, -58%
- `scalar_utf8_multi_char/pos_middle`: 355 µs → 132 µs, -63%
- `scalar_utf8_long_strings/pos_middle`: 1.97 ms → 1.11 ms, -43%
- `scalar_utf8view_long_parts/pos_middle`: 467 µs → 169 µs, -63%
- `array_utf8_single_char/pos_middle`: 351 µs → 357 µs, no change
- `array_utf8_multi_char/pos_middle`: 366 µs → 357 µs, -2.6%
## What changes are included in this PR?
* Add benchmarks for `split_part` with scalar delimiter and position
* Add new fast-path for `split_part` with scalar delimiter and position
* Add SLT tests for `split_part` with scalar delimiter and position
## Are these changes tested?
Yes.
## Are there any user-facing changes?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]