Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

via GitHub Tue, 21 May 2024 21:03:01 -0700


mkaravel commented on PR #46511:
URL: https://github.com/apache/spark/pull/46511#issuecomment-2123824322


   @uros-db 
   
   I am reading this in the PR description:
   > Why are the changes needed?
   Fix functions that give unexpected results due to the conditional case 
mapping when performing string comparisons under UTF8_BINARY_LCASE.
   
   This is not really the reason we we are making this change (conditional case 
mapping is the reason why we want to change the comparator for 
UTF8_BINARY_LCASE, but not related to string search directly).
   
   The real reason is that we need to do string search at the UT8 char-by-char 
level because otherwise indices returned or considered during string search can 
return strange and unusable results due to the fact that case mappings are 
one-to-many, that is one UTF8 character can map to two or more UTF8 characters 
under the case mapping (both for upper and lower casing).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

Reply via email to