Re: [PR] [SPARK-47415][SQL] Collation support: Levenshtein [spark]

via GitHub Wed, 29 May 2024 02:57:13 -0700


nikolamand-db commented on PR #45963:
URL: https://github.com/apache/spark/pull/45963#issuecomment-2137018880


   After a lengthy discussion we decided to stop Levenshtein collations effort. 
The well defined definition of Levenshtein as _the minimum number of 
single-character edits to make the strings compare equal_ would be interpreted 
under collations as _the minimum number of single-character edits to make the 
strings equal under given collation_ is very hard to achieve under 
non-exponential time.
   
   Reason for this is that Levenshtein distance algorithm assumes we can 
compare single character in both strings for equality whereas under collations 
we cannot safely assume it is enough to compare single character. For example, 
sequences `İ` and `i\u0307` compare equal under `UTF8_BINARY_LCASE` collation 
but have different number of characters (1 vs 2 in order).
   
   Thanks @nebojsa-db and reviewers for your effort. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47415][SQL] Collation support: Levenshtein [spark]

Reply via email to