sarutak opened a new pull request, #56310:
URL: https://github.com/apache/spark/pull/56310

   ### What changes were proposed in this pull request?
   This PR adds a new built-in string function `jaro_winkler_similarity(str1, 
str2)` that computes the Jaro-Winkler similarity between two strings. The 
result is a double between 0.0 (no similarity) and 1.0 (identical strings).
   
   The Jaro-Winkler metric is an extension of the Jaro similarity that gives a 
bonus for common prefixes (up to 4 characters), making it especially suited for 
short strings such as names.
   
   Changes:
   - `UTF8String.jaroWinklerSimilarity()`  (core algorithm implementation)
   - `JaroWinkler` expression with codegen support
   - Registration as `jaro_winkler_similarity` in `FunctionRegistry`
   - Unit tests covering basic cases, symmetry, multi-byte strings, and null 
handling
   
   ### Why are the changes needed?
   Jaro-Winkler is one of the most commonly used string similarity metrics for 
record linkage, deduplication, and fuzzy matching. It is available as a 
built-in function in DuckDB, SQL Server 2025, and various other engines, but 
Spark users currently need to implement it as a UDF.
   
   A built-in function provides:
   - Codegen support (no UDF overhead in whole-stage codegen pipelines)
   - Consistency with the existing `levenshtein` function
   - Better discoverability for SQL users
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. A new SQL function `jaro_winkler_similarity` is added.
   
   ### How was this patch tested?
   Added new tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Kiro CLI / Claude


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to