[GitHub] [spark] chenhao-db commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

via GitHub Mon, 15 May 2023 10:07:28 -0700


chenhao-db commented on PR #41169:
URL: https://github.com/apache/spark/pull/41169#issuecomment-1548233629


   I am wondering whether it is better to follow PostgreSQL's semantics:
   
   > If the actual distance is less than or equal to max_d, then 
levenshtein_less_equal returns the correct distance; otherwise it returns some 
value greater than max_d.
   
   or to follow 
`org.apache.commons.text.similarity.LevenshteinDistance.limitedCompare`'s 
semantics to return -1 when the distance is greater than the threshold (the 
current code).
   
   I think the former is probably better: the optimizer can safely convert 
`levenshtein(s1, s2) < c` into `levenshtein(s1, s2, c) < c`, which I believe 
should be a quite common use case of `levenshtein`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] chenhao-db commented on pull request #41169: [SPARK-43493][SQL] Add a max distance argument to the levenshtein() function

Reply via email to