nikolamand-db commented on PR #45963: URL: https://github.com/apache/spark/pull/45963#issuecomment-2137018880
After a lengthy discussion we decided to stop Levenshtein collations effort. The well defined definition of Levenshtein as _the minimum number of single-character edits to make the strings compare equal_ would be interpreted under collations as _the minimum number of single-character edits to make the strings equal under given collation_ is very hard to achieve under non-exponential time. Reason for this is that Levenshtein distance algorithm assumes we can compare single character in both strings for equality whereas under collations we cannot safely assume it is enough to compare single character. For example, sequences `İ` and `i\u0307` compare equal under `UTF8_BINARY_LCASE` collation but have different number of characters (1 vs 2 in order). Thanks @nebojsa-db and reviewers for your effort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
