[ 
https://issues.apache.org/jira/browse/SPARK-47863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-47863:
-------------------------------------
    Description: 
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.

Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.{*}{}}}{{{}{*}testEndsWith{*}.{}}}

{{The first passes, since it uses *StringSearch* directly, the second one does 
not.}}

  was:
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.



Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.}}{*}{{{}{*}testEndsWith{*}.{}}}

{{{}The first passes, since it uses *StringSearch* directly, the second one 
does not.{}}}{{{}{}}}


> endsWith and startsWith don't work correctly for some collations
> ----------------------------------------------------------------
>
>                 Key: SPARK-47863
>                 URL: https://issues.apache.org/jira/browse/SPARK-47863
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Vladimir Golubev
>            Priority: Major
>
> *CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
> {*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to 
> compare prefixes/suffixes. This is not correct, since sometimes string parts 
> (suffix/prefix) of different lengths are actually equal in context of 
> case-insensitive and lower-case collations.
> Example test cases that highlight the problem:
> {{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
> {{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
> }}{*}{{CollationSupportSuite.{*}{}}}{{{}{*}testEndsWith{*}.{}}}
> {{The first passes, since it uses *StringSearch* directly, the second one 
> does not.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to