Natea Eshetu Beshada created SPARK-55430:
--------------------------------------------

             Summary:   [SQL] Cache ICU StringSearch for collation string 
predicates with constant patterns
                 Key: SPARK-55430
                 URL: https://issues.apache.org/jira/browse/SPARK-55430
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.1.0
            Reporter: Natea Eshetu Beshada


  This PR adds StringSearch object caching for Contains, StartsWith, and 
EndsWith expressions when used with ICU-based collations (UNICODE, UNICODE_CI) 
and a compile-time constant (foldable) pattern.

  Currently, every row evaluation creates a new com.ibm.icu.text.StringSearch 
object, which involves setting up the ICU collator and pattern matcher from 
scratch. When the pattern is a constant (e.g., col LIKE
  '%abc%' or contains(col, 'abc')), this repeated construction is unnecessary.

  With this change, a single StringSearch is created once and reused across 
rows by calling setTarget() for each new input string. This applies to both the 
interpreted path (via @transient private lazy val) and the
   codegen path (via ctx.addMutableState).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to