cloud-fan commented on code in PR #56310:
URL: https://github.com/apache/spark/pull/56310#discussion_r3376672718


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -2721,6 +2721,53 @@ case class Levenshtein(
   }
 }
 
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(str1, str2) - Returns the Jaro-Winkler similarity between 
the two given strings. The result is a double between 0 and 1, where 1 means 
identical.",
+  examples = """
+    Examples:
+      > SELECT _FUNC_('MARTHA', 'MARHTA');
+       0.9611111111111111
+      > SELECT _FUNC_('kitten', 'sitting');
+       0.746031746031746
+      > SELECT _FUNC_('ABC', 'XYZ');
+       0.0
+  """,
+  since = "4.3.0",
+  group = "string_funcs")
+// scalastyle:on line.size.limit
+case class JaroWinkler(left: Expression, right: Expression)

Review Comment:
   Consider implementing this as a `RuntimeReplaceable` expression delegating 
to a `StaticInvoke`, rather than hand-rolling `nullSafeEval` + `doGenCode` on 
top of a `UTF8String` method.
   
   This copies `Levenshtein`'s pattern, but `Levenshtein` only hand-rolls 
codegen because of its `threshold` short-circuit (early exit) — 
`jaro_winkler_similarity` has no such optimization, so there's no reason to 
write custom eval/codegen. The established pattern for "delegate the algorithm 
to a JVM helper" is `RuntimeReplaceable` + `StaticInvoke`, used by the 
immediately-adjacent registry neighbor `luhn_check` (`Luhncheck`) and 17 other 
expressions in this file. `BinaryPredicate` (a few hundred lines up) is the 
two-arg template:
   
   ```scala
   case class JaroWinkler(left: Expression, right: Expression)
     extends RuntimeReplaceable with ImplicitCastInputTypes with 
BinaryLike[Expression] {
   
     override lazy val replacement: Expression = StaticInvoke(
       classOf[ExpressionImplUtils],
       DoubleType,
       "jaroWinklerSimilarity",
       Seq(left, right),
       inputTypes)
   
     override def inputTypes: Seq[AbstractDataType] = Seq(
       StringTypeWithCollation(supportsTrimCollation = true),
       StringTypeWithCollation(supportsTrimCollation = true))
   
     override def prettyName: String = "jaro_winkler_similarity"
   
     override protected def withNewChildrenInternal(
         newLeft: Expression, newRight: Expression): Expression =
       copy(left = newLeft, right = newRight)
   }
   ```
   
   Benefits:
   - `RuntimeReplaceable` makes `eval`/`doGenCode` `final` (delegating to 
`replacement`), so you drop both — and with them the interpreted-vs-codegen 
dual path.
   - `StaticInvoke`'s default `propagateNull = true` gives null-intolerance, 
and `dataType` is derived from the replacement, so the `nullIntolerant` and 
`dataType` overrides also go away.
   
   I'd also move the algorithm into `ExpressionImplUtils` (where `isLuhnNumber` 
lives — it already imports `UTF8String`) as a static 
`jaroWinklerSimilarity(UTF8String, UTF8String): double`. The current 
`UTF8String.jaroWinklerSimilarity` has no other caller and immediately 
`toString()`s both operands, so it never uses the byte representation that 
`UTF8String` methods exist for (unlike `levenshteinDistance`, which genuinely 
iterates the bytes) — it doesn't really belong on `UTF8String`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to