Natea Eshetu Beshada created SPARK-55430:
--------------------------------------------
Summary: [SQL] Cache ICU StringSearch for collation string
predicates with constant patterns
Key: SPARK-55430
URL: https://issues.apache.org/jira/browse/SPARK-55430
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.1.0
Reporter: Natea Eshetu Beshada
This PR adds StringSearch object caching for Contains, StartsWith, and
EndsWith expressions when used with ICU-based collations (UNICODE, UNICODE_CI)
and a compile-time constant (foldable) pattern.
Currently, every row evaluation creates a new com.ibm.icu.text.StringSearch
object, which involves setting up the ICU collator and pattern matcher from
scratch. When the pattern is a constant (e.g., col LIKE
'%abc%' or contains(col, 'abc')), this repeated construction is unnecessary.
With this change, a single StringSearch is created once and reused across
rows by calling setTarget() for each new input string. This applies to both the
interpreted path (via @transient private lazy val) and the
codegen path (via ctx.addMutableState).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]