[
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-47418:
-----------------------------------
Labels: pull-request-available (was: )
> Optimize string predicate expressions for UTF8_BINARY_LCASE collation
> ---------------------------------------------------------------------
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Uroš Bojanić
> Priority: Major
> Labels: pull-request-available
>
> Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string
> Spark functions using optimized lowercase comparison approach introduced by
> [~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to
> the latest design and code structure imposed by [~uros-db] in
> https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation
> support is introduced for Spark SQL expressions. In addition, review previous
> Jira tickets under the current parent in order to understand how
> *StringPredicate* expressions are currently used and tested in Spark:
> * [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
> * [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
> * [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]
> These tickets should help you understand what changes were introduced in
> order to enable collation support for these functions. Lastly, feel free to
> use your chosen Spark SQL Editor to play around with the existing functions
> and learn more about how they work.
>
> The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE
> implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith*
> functions so that they use optimized lowercase comparison approach (following
> the general logic in Nikola's PR), and benchmark the results accordingly. As
> for testing, the currently existing unit test cases and end-to-end tests
> should already fully cover the expected behaviour of *StringPredicate*
> expressions for all collation types. In other words, the objective of this
> ticket is only to enhance the internal implementation, without introducing
> any user-facing changes to Spark SQL API.
>
> Finally, feel free to refer to the Unicode Technical Standard for string
> [searching|https://www.unicode.org/reports/tr10/#Searching] and
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]