cloud-fan commented on code in PR #56310:
URL: https://github.com/apache/spark/pull/56310#discussion_r3356927103
##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -15577,6 +15577,40 @@ def levenshtein(
)
+@_try_remote_functions
+def jaro_winkler_similarity(left: "ColumnOrName", right: "ColumnOrName") ->
Column:
+ """Computes the Jaro-Winkler similarity between the two given strings.
+
+ The result is a double between 0.0 (no similarity) and 1.0 (identical
strings).
+
+ .. versionadded:: 4.3.0
+
+ Parameters
+ ----------
+ left : :class:`~pyspark.sql.Column` or column name
+ first column value.
+ right : :class:`~pyspark.sql.Column` or column name
+ second column value.
+
+ Returns
+ -------
+ :class:`~pyspark.sql.Column`
+ Jaro-Winkler similarity as a double value.
+
+ Examples
+ --------
+ >>> from pyspark.sql import functions as sf
+ >>> df = spark.createDataFrame([('MARTHA', 'MARHTA')], ['l', 'r'])
+ >>> df.select('*', sf.jaro_winkler_similarity('l', 'r')).show()
+ +------+------+-------------------------------+
+ | l| r|jaro_winkler_similarity(l, r)|
Review Comment:
**Blocking:** this `.show()` table looks hand-written and will fail the
doctest. The third column's separator rows carry 31 dashes, but the column
header `jaro_winkler_similarity(l, r)` is 29 characters, so the real rendered
column width is 29. The doctest runs under `ELLIPSIS | NORMALIZE_WHITESPACE`
(see `_test()`), which tolerates the value-cell space padding but **not** the
dash-count mismatch (dashes aren't whitespace), so the comparison fails.
Please regenerate the block by actually running the example. The third
column should be 29 dashes wide, e.g.:
```
+------+------+-----------------------------+
| l| r|jaro_winkler_similarity(l, r)|
+------+------+-----------------------------+
|MARTHA|MARHTA| 0.9611111111111111|
+------+------+-----------------------------+
```
##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1981,6 +1981,67 @@ public boolean semanticEquals(final UTF8String other,
int collationId) {
return
CollationFactory.fetchCollation(collationId).equalsFunction.apply(this, other);
}
+ /**
+ * Returns the Jaro-Winkler similarity between this string and another, as a
double in [0, 1].
+ * A score of 1.0 means the strings are identical, 0.0 means no similarity.
+ * The Jaro-Winkler metric gives a bonus for common prefixes (up to 4
characters).
+ */
+ public double jaroWinklerSimilarity(UTF8String other) {
+ // Decode to char arrays for character-level comparison
Review Comment:
Nit: the code creates `String` objects (`s1`, `s2`), not char arrays, and
compares via `charAt()`.
```suggestion
// Decode to Strings for character-level comparison
```
##########
python/docs/source/reference/pyspark.sql/functions.rst:
##########
@@ -184,6 +184,7 @@ String Functions
left
length
levenshtein
+ jaro_winkler_similarity
Review Comment:
Nit: this list is alphabetical (`left`, `length`, `levenshtein`, `locate`,
`lower`, `lpad`), so `jaro_winkler_similarity` belongs before the `l*` entries
rather than after `levenshtein`. The same ordering issue exists in
`python/pyspark/sql/functions/__init__.py`'s `__all__`. (The Scala
`FunctionRegistry` / `functions.scala` placements next to `levenshtein` are
fine — those lists aren't alphabetical.)
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2722,6 +2722,52 @@ case class Levenshtein(
}
}
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+ usage = "_FUNC_(str1, str2) - Returns the Jaro-Winkler similarity between
the two given strings. The result is a double between 0 and 1, where 1 means
identical.",
+ examples = """
+ Examples:
+ > SELECT _FUNC_('MARTHA', 'MARHTA');
+ 0.9611111111111111
+ > SELECT _FUNC_('kitten', 'sitting');
+ 0.746031746031746
+ > SELECT _FUNC_('ABC', 'XYZ');
+ 0.0
+ """,
+ since = "4.3.0",
+ group = "string_funcs")
+// scalastyle:on line.size.limit
+case class JaroWinkler(left: Expression, right: Expression)
+ extends BinaryExpression
+ with ImplicitCastInputTypes {
+
+ override def nullIntolerant: Boolean = true
+
+ override def inputTypes: Seq[AbstractDataType] = Seq(
+ StringTypeWithCollation(supportsTrimCollation = true),
+ StringTypeWithCollation(supportsTrimCollation = true))
+
+ override def dataType: DataType = DoubleType
+
+ override def prettyName: String = "jaro_winkler_similarity"
+
+ override protected def withNewChildrenInternal(
+ newLeft: Expression, newRight: Expression): JaroWinkler =
+ copy(left = newLeft, right = newRight)
+
+ // Note: This implementation uses String.charAt() which operates on UTF-16
code units.
+ // Strings containing supplementary characters (surrogate pairs) may produce
+ // inaccurate results, consistent with the existing levenshtein()
implementation.
Review Comment:
This claim is inaccurate: `levenshteinDistance` does **not** use
`String.charAt()`. It iterates UTF-8 character boundaries directly on the bytes
(`numChars()` / `getByte()` / `numBytesForFirstByte()`), so it treats a
supplementary character as a single unit — whereas `jaroWinklerSimilarity`
decodes via `toString()` + `charAt()` and splits a supplementary character into
two UTF-16 units. So the two are not consistent, and the `toString()` path also
allocates two Java `String`s per row.
The cleanest fix is to implement `jaroWinklerSimilarity` on the `UTF8String`
representation the way `levenshteinDistance` does — that removes the per-row
allocation, makes supplementary-character handling genuinely consistent with
`levenshtein`, and makes this note true. If you prefer to keep the `charAt`
approach, please at least reword the comment so it doesn't assert consistency
with `levenshtein`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]