nateab opened a new pull request, #54241: URL: https://github.com/apache/spark/pull/54241
## What changes were proposed in this pull request? Add StringSearch object caching for `Contains`, `StartsWith`, and `EndsWith` expressions when used with ICU-based collations (UNICODE, UNICODE_CI) and a compile-time constant (foldable) pattern. Currently, every row evaluation creates a new `com.ibm.icu.text.StringSearch` object. When the pattern is constant, this repeated construction is unnecessary. With this change, a single `StringSearch` is created once and reused via `setTarget()` for each new input string — both in interpreted (`@transient private lazy val`) and codegen (`ctx.addMutableState`) paths. **Changes:** - `CollationFactory`: add `getStringSearchForPattern()` factory method - `CollationSupport`: add cached `execICU()` overloads for Contains, StartsWith, EndsWith - `stringExpressions.scala`: wire caching into expression eval and codegen when pattern is foldable and collation is ICU-based - `CollationBenchmark`: add fixed-pattern benchmarks ## Why are the changes needed? ICU StringSearch construction is expensive. For queries scanning large tables with constant string predicates under ICU collations, this overhead is incurred on every row. Caching yields 3-3.4X improvement. ## Does this PR introduce any user-facing change? No. Performance optimization only. ## How was this patch tested? All 192 existing collation tests pass across 7 test suites. New fixed-pattern benchmarks added: | Operation | Varying pattern | Fixed pattern (cached) | Improvement | |---|---|---|---| | Contains (UNICODE vs UTF8_BINARY) | 115.0X slower | 33.8X slower | **3.4X** | | StartsWith | 124.2X slower | 37.1X slower | **3.3X** | | EndsWith | 137.5X slower | 50.0X slower | **2.8X** | ## Was this patch authored or co-authored using generative AI tooling? Yes, Claude Code was used as an AI coding assistant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
