mkaravel commented on PR #46511: URL: https://github.com/apache/spark/pull/46511#issuecomment-2123824322
@uros-db I am reading this in the PR description: > Why are the changes needed? Fix functions that give unexpected results due to the conditional case mapping when performing string comparisons under UTF8_BINARY_LCASE. This is not really the reason we we are making this change (conditional case mapping is the reason why we want to change the comparator for UTF8_BINARY_LCASE, but not related to string search directly). The real reason is that we need to do string search at the UT8 char-by-char level because otherwise indices returned or considered during string search can return strange and unusable results due to the fact that case mappings are one-to-many, that is one UTF8 character can map to two or more UTF8 characters under the case mapping (both for upper and lower casing). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
