panbingkun commented on code in PR #41169: URL: https://github.com/apache/spark/pull/41169#discussion_r1196188986
########## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala: ########## @@ -2134,30 +2134,145 @@ case class OctetLength(child: Expression) * A function that return the Levenshtein distance between the two given strings. */ @ExpressionDescription( - usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings.", + usage = """ + _FUNC_(str1, str2) - Returns the Levenshtein distance between the two given strings. + If threshold is set and distance more than it, return -1.""", examples = """ Examples: > SELECT _FUNC_('kitten', 'sitting'); 3 + > SELECT _FUNC_('kitten', 'sitting', 2); + -1 """, since = "1.5.0", group = "string_funcs") -case class Levenshtein(left: Expression, right: Expression) extends BinaryExpression - with ImplicitCastInputTypes with NullIntolerant { +case class Levenshtein( + left: Expression, + right: Expression, + threshold: Option[Expression] = None) + extends Expression Review Comment: 1.If we extend TernaryExpression, then threshold expression will not be optional, just like the implementation logic of `Substring`. When the threshold value is not passed in, we will set a default value for it: Integer.MAX_ Value, the levenshteinDistance algorithm in UTF8String, only one reserved - levenshteinDistance(UTF8String other, int threshold) But it seems that `org.apache.commons.text.similarity.LevenshteinDistance` did not do this, it retained two - `limitedCompare` & `unlimitedCompare`. https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java Are we really going to do this? 2.Additionally, I am referring to the implementation of [`ArrayJoin`](https://github.com/apache/spark/blob/d44e073f0cdaf16028a4854e79db200a4e39a6fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1824-L1854) in spark 3.UT referring to https://github.com/apache/commons-text/blob/master/src/test/java/org/apache/commons/text/similarity/LevenshteinDistanceTest.java#L76-L138 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org