chenhao-db commented on PR #41169: URL: https://github.com/apache/spark/pull/41169#issuecomment-1548233629
I am wondering whether it is better to follow PostgreSQL's semantics: > If the actual distance is less than or equal to max_d, then levenshtein_less_equal returns the correct distance; otherwise it returns some value greater than max_d. or to follow `org.apache.commons.text.similarity.LevenshteinDistance.limitedCompare`'s semantics to return -1 when the distance is greater than the threshold (the current code). I think the former is probably better: the optimizer can safely convert `levenshtein(s1, s2) < c` into `levenshtein(s1, s2, c) < c`, which I believe should be a quite common use case of `levenshtein`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
