panbingkun commented on code in PR #41169:
URL: https://github.com/apache/spark/pull/41169#discussion_r1196188986
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2134,30 +2134,145 @@ case class OctetLength(child: Expression)
* A function that return the Levenshtein distance between the two given
strings.
*/
@ExpressionDescription(
- usage = "_FUNC_(str1, str2) - Returns the Levenshtein distance between the
two given strings.",
+ usage = """
+ _FUNC_(str1, str2) - Returns the Levenshtein distance between the two
given strings.
+ If threshold is set and distance more than it, return -1.""",
examples = """
Examples:
> SELECT _FUNC_('kitten', 'sitting');
3
+ > SELECT _FUNC_('kitten', 'sitting', 2);
+ -1
""",
since = "1.5.0",
group = "string_funcs")
-case class Levenshtein(left: Expression, right: Expression) extends
BinaryExpression
- with ImplicitCastInputTypes with NullIntolerant {
+case class Levenshtein(
+ left: Expression,
+ right: Expression,
+ threshold: Option[Expression] = None)
+ extends Expression
Review Comment:
1.If we extend TernaryExpression, then threshold expression will not be
optional, just like the implementation logic of `Substring`. When the threshold
value is not passed in, we will set a default value for it: Integer.MAX_ Value,
the levenshteinDistance algorithm in UTF8String, only one reserved -
levenshteinDistance(UTF8String other, int threshold)
But it seems that `org.apache.commons.text.similarity.LevenshteinDistance`
did not do this, it retained two - `limitedCompare` & `unlimitedCompare`.
https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java
Are we really going to do this?
2.Additionally, I am referring to the implementation of
[`ArrayJoin`](https://github.com/apache/spark/blob/d44e073f0cdaf16028a4854e79db200a4e39a6fe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1824-L1854)
in spark
3.UT referring to
https://github.com/apache/commons-text/blob/master/src/test/java/org/apache/commons/text/similarity/LevenshteinDistanceTest.java#L76-L138
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]