[ 
https://issues.apache.org/jira/browse/SPARK-47295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47295:
---------------------------------
    Summary: startswith, endswith (all collations)  (was: startswith, endswith 
(non-binary collations))

> startswith, endswith (all collations)
> -------------------------------------
>
>                 Key: SPARK-47295
>                 URL: https://issues.apache.org/jira/browse/SPARK-47295
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Uroš Bojanić
>            Assignee: Stevo Mitric
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> Implement *startsWith* and *endsWith* built-in string Spark functions using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Refer to 
> the latest unit tests in CollationSuite to understand how these functions are 
> used in SparkSQL, and feel free to use your chosen Spark SQL Editor to play 
> around with the existing functions to learn more about how they work.
>  
> Currently, these 2 functions support all collation types:
>  # binary collations (UCS_BASIC, UNICODE) *special cases - these collation 
> types work using the existing string comparison functions - i.e. contains(), 
> startsWith(), endsWith()
>  # special lowercase non-binary collations (UCS_BASIC) *special case - these 
> collation types work by using lower() to convert both strings to lowercase, 
> and then use above functions
>  # other non-binary collations (UNICODE_CI; special collations for various 
> languages with case and accent sensitivity) - these collation types usually 
> require special handling, which can sometimes be complex
>  
> To understand what changes were introduced in order to enable collation 
> support for these functions, take a look at the Spark PRs and Jira tickets 
> below:
>  * [https://github.com/apache/spark/pull/45216] this PR enables:
>  ** partial collation support for *contains* (skipping the 3rd type of 
> collations shown above)
>  ** complete collation support for {*}startsWith{*}, *endsWith* (using a 
> special _matchAt_ implementation directly in {_}UTF8String{_})
>  * [https://github.com/apache/spark/pull/45382] this PR enables:
>  ** complete collation support for *contains* (using {_}StringSearch{_}) _-> 
> now we should also use this approach for startsWith & endsWith_
>  
> Focusing on the 3rd type of collations as shown above, the goal for this Jira 
> ticket is to re-implement the *startsWith* and *endsWith* functions so that 
> they use _StringSearch_ instead (following the general logic in the second 
> PR). As for the current test cases in CollationSuite, they should already 
> mostly cover the expected behaviour of *startsWith* and *endsWith* for the 
> 3rd type of collations.
>  
> Read more about _StringSearch_ using the [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to