Vladimir Golubev created SPARK-47863:
----------------------------------------

             Summary: endsWith and startsWith don't work correctly for some 
collations
                 Key: SPARK-47863
                 URL: https://issues.apache.org/jira/browse/SPARK-47863
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Vladimir Golubev


*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.



Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.}}{*}{{{}{*}testEndsWith{*}.{}}}

{{{}The first passes, since it uses *StringSearch* directly, the second one 
does not.{}}}{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to