[
https://issues.apache.org/jira/browse/SPARK-47295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-47295:
-----------------------------------
Labels: pull-request-available (was: )
> startswith, endswith (non-binary collations)
> --------------------------------------------
>
> Key: SPARK-47295
> URL: https://issues.apache.org/jira/browse/SPARK-47295
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Uroš Bojanić
> Priority: Major
> Labels: pull-request-available
>
> Implement *startsWith* and *endsWith* built-in string Spark functions using
> {_}StringSearch{_}, an efficient ICU service for string matching. Refer to
> the latest unit tests in CollationSuite to understand how these functions are
> used in SparkSQL, and feel free to use your chosen Spark SQL Editor to play
> around with the existing functions to learn more about how they work.
>
> Currently, these 2 functions support all collation types:
> # binary collations (UCS_BASIC, UNICODE) *special cases - these collation
> types work using the existing string comparison functions - i.e. contains(),
> startsWith(), endsWith()
> # special lowercase non-binary collations (UCS_BASIC) *special case - these
> collation types work by using lower() to convert both strings to lowercase,
> and then use above functions
> # other non-binary collations (UNICODE_CI; special collations for various
> languages with case and accent sensitivity) - these collation types usually
> require special handling, which can sometimes be complex
>
> To understand what changes were introduced in order to enable collation
> support for these functions, take a look at the Spark PRs and Jira tickets
> below:
> * [https://github.com/apache/spark/pull/45216] this PR enables:
> ** partial collation support for *contains* (skipping the 3rd type of
> collations shown above)
> ** complete collation support for {*}startsWith{*}, *endsWith* (using a
> special _matchAt_ implementation directly in {_}UTF8String{_})
> * [https://github.com/apache/spark/pull/45382] this PR enables:
> ** complete collation support for *contains* (using {_}StringSearch{_}) _->
> now we should also use this approach for startsWith & endsWith_
>
> Focusing on the 3rd type of collations as shown above, the goal for this Jira
> ticket is to re-implement the *startsWith* and *endsWith* functions so that
> they use _StringSearch_ instead (following the general logic in the second
> PR). As for the current test cases in CollationSuite, they should already
> mostly cover the expected behaviour of *startsWith* and *endsWith* for the
> 3rd type of collations.
>
> Read more about _StringSearch_ using the [ICU user
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
> and [ICU
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
> Also, refer to the Unicode Technical Standard for string
> [searching|https://www.unicode.org/reports/tr10/#Searching] and
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]