cloud-fan commented on code in PR #56310:
URL: https://github.com/apache/spark/pull/56310#discussion_r3356927103


##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -15577,6 +15577,40 @@ def levenshtein(
         )
 
 
+@_try_remote_functions
+def jaro_winkler_similarity(left: "ColumnOrName", right: "ColumnOrName") -> 
Column:
+    """Computes the Jaro-Winkler similarity between the two given strings.
+
+    The result is a double between 0.0 (no similarity) and 1.0 (identical 
strings).
+
+    .. versionadded:: 4.3.0
+
+    Parameters
+    ----------
+    left : :class:`~pyspark.sql.Column` or column name
+        first column value.
+    right : :class:`~pyspark.sql.Column` or column name
+        second column value.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        Jaro-Winkler similarity as a double value.
+
+    Examples
+    --------
+    >>> from pyspark.sql import functions as sf
+    >>> df = spark.createDataFrame([('MARTHA', 'MARHTA')], ['l', 'r'])
+    >>> df.select('*', sf.jaro_winkler_similarity('l', 'r')).show()
+    +------+------+-------------------------------+
+    |     l|     r|jaro_winkler_similarity(l, r)|

Review Comment:
   **Blocking:** this `.show()` table looks hand-written and will fail the 
doctest. The third column's separator rows carry 31 dashes, but the column 
header `jaro_winkler_similarity(l, r)` is 29 characters, so the real rendered 
column width is 29. The doctest runs under `ELLIPSIS | NORMALIZE_WHITESPACE` 
(see `_test()`), which tolerates the value-cell space padding but **not** the 
dash-count mismatch (dashes aren't whitespace), so the comparison fails.
   
   Please regenerate the block by actually running the example. The third 
column should be 29 dashes wide, e.g.:
   ```
   +------+------+-----------------------------+
   |     l|     r|jaro_winkler_similarity(l, r)|
   +------+------+-----------------------------+
   |MARTHA|MARHTA|           0.9611111111111111|
   +------+------+-----------------------------+
   ```



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1981,6 +1981,67 @@ public boolean semanticEquals(final UTF8String other, 
int collationId) {
     return 
CollationFactory.fetchCollation(collationId).equalsFunction.apply(this, other);
   }
 
+  /**
+   * Returns the Jaro-Winkler similarity between this string and another, as a 
double in [0, 1].
+   * A score of 1.0 means the strings are identical, 0.0 means no similarity.
+   * The Jaro-Winkler metric gives a bonus for common prefixes (up to 4 
characters).
+   */
+  public double jaroWinklerSimilarity(UTF8String other) {
+    // Decode to char arrays for character-level comparison

Review Comment:
   Nit: the code creates `String` objects (`s1`, `s2`), not char arrays, and 
compares via `charAt()`.
   ```suggestion
       // Decode to Strings for character-level comparison
   ```



##########
python/docs/source/reference/pyspark.sql/functions.rst:
##########
@@ -184,6 +184,7 @@ String Functions
     left
     length
     levenshtein
+    jaro_winkler_similarity

Review Comment:
   Nit: this list is alphabetical (`left`, `length`, `levenshtein`, `locate`, 
`lower`, `lpad`), so `jaro_winkler_similarity` belongs before the `l*` entries 
rather than after `levenshtein`. The same ordering issue exists in 
`python/pyspark/sql/functions/__init__.py`'s `__all__`. (The Scala 
`FunctionRegistry` / `functions.scala` placements next to `levenshtein` are 
fine — those lists aren't alphabetical.)



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2722,6 +2722,52 @@ case class Levenshtein(
   }
 }
 
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(str1, str2) - Returns the Jaro-Winkler similarity between 
the two given strings. The result is a double between 0 and 1, where 1 means 
identical.",
+  examples = """
+    Examples:
+      > SELECT _FUNC_('MARTHA', 'MARHTA');
+       0.9611111111111111
+      > SELECT _FUNC_('kitten', 'sitting');
+       0.746031746031746
+      > SELECT _FUNC_('ABC', 'XYZ');
+       0.0
+  """,
+  since = "4.3.0",
+  group = "string_funcs")
+// scalastyle:on line.size.limit
+case class JaroWinkler(left: Expression, right: Expression)
+  extends BinaryExpression
+  with ImplicitCastInputTypes {
+
+  override def nullIntolerant: Boolean = true
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(
+    StringTypeWithCollation(supportsTrimCollation = true),
+    StringTypeWithCollation(supportsTrimCollation = true))
+
+  override def dataType: DataType = DoubleType
+
+  override def prettyName: String = "jaro_winkler_similarity"
+
+  override protected def withNewChildrenInternal(
+      newLeft: Expression, newRight: Expression): JaroWinkler =
+    copy(left = newLeft, right = newRight)
+
+  // Note: This implementation uses String.charAt() which operates on UTF-16 
code units.
+  // Strings containing supplementary characters (surrogate pairs) may produce
+  // inaccurate results, consistent with the existing levenshtein() 
implementation.

Review Comment:
   This claim is inaccurate: `levenshteinDistance` does **not** use 
`String.charAt()`. It iterates UTF-8 character boundaries directly on the bytes 
(`numChars()` / `getByte()` / `numBytesForFirstByte()`), so it treats a 
supplementary character as a single unit — whereas `jaroWinklerSimilarity` 
decodes via `toString()` + `charAt()` and splits a supplementary character into 
two UTF-16 units. So the two are not consistent, and the `toString()` path also 
allocates two Java `String`s per row.
   
   The cleanest fix is to implement `jaroWinklerSimilarity` on the `UTF8String` 
representation the way `levenshteinDistance` does — that removes the per-row 
allocation, makes supplementary-character handling genuinely consistent with 
`levenshtein`, and makes this note true. If you prefer to keep the `charAt` 
approach, please at least reword the comment so it doesn't assert consistency 
with `levenshtein`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to