panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1196188986


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2134,30 +2134,145 @@ case class OctetLength(child: Expression)
  * A function that return the Levenshtein distance between the two given 
strings.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the 
two given strings.",
+  usage = """
+    _FUNC_(str1, str2) - Returns the Levenshtein distance between the two 
given strings.
+      If threshold is set and distance more than it, return -1.""",
   examples = """
     Examples:
       > SELECT _FUNC_('kitten', 'sitting');
        3
+      > SELECT _FUNC_('kitten', 'sitting', 2);
+       -1
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends 
BinaryExpression
-    with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+    left: Expression,
+    right: Expression,
+    threshold: Option[Expression] = None)
+  extends Expression

Review Comment:
   1.If we extend TernaryExpression, then threshold expression will not be 
optional, just like the implementation logic of `Substring`. When the threshold 
value is not passed in, we will set a default value for it: Integer.MAX_ Value, 
the levenshteinDistance algorithm in UTF8String, only one reserved - 
levenshteinDistance(UTF8String other, int threshold)
   But it seems that `org.apache.commons.text.similarity.LevenshteinDistance` 
did not do this, it retained two - `limitedCompare` & `unlimitedCompare`.
   
https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java
   
   Are we really going to do this?
   
   2.Additionally, I am referring to the implementation of 
[`ArrayJoin`](https://github.com/apache/spark/blob/d44e073f0cdaf16028a4854e79db200a4e39a6fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1824-L1854)
 in spark
   
   3.UT referring to 
https://github.com/apache/commons-text/blob/master/src/test/java/org/apache/commons/text/similarity/LevenshteinDistanceTest.java#L76-L138



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to