[
https://issues.apache.org/jira/browse/SPARK-55430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057450#comment-18057450
]
Natea Eshetu Beshada commented on SPARK-55430:
----------------------------------------------
i would like to assign myself this issue but i cant seem to
> [SQL] Cache ICU StringSearch for collation string predicates with constant
> patterns
> -------------------------------------------------------------------------------------
>
> Key: SPARK-55430
> URL: https://issues.apache.org/jira/browse/SPARK-55430
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Natea Eshetu Beshada
> Priority: Major
>
> This PR adds StringSearch object caching for Contains, StartsWith, and
> EndsWith expressions when used with ICU-based collations (UNICODE,
> UNICODE_CI) and a compile-time constant (foldable) pattern.
> Currently, every row evaluation creates a new com.ibm.icu.text.StringSearch
> object, which involves setting up the ICU collator and pattern matcher from
> scratch. When the pattern is a constant (e.g., col LIKE
> '%abc%' or contains(col, 'abc')), this repeated construction is unnecessary.
> With this change, a single StringSearch is created once and reused across
> rows by calling setTarget() for each new input string. This applies to both
> the interpreted path (via @transient private lazy val) and the
> codegen path (via ctx.addMutableState).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]