sarutak opened a new pull request, #56310: URL: https://github.com/apache/spark/pull/56310
### What changes were proposed in this pull request? This PR adds a new built-in string function `jaro_winkler_similarity(str1, str2)` that computes the Jaro-Winkler similarity between two strings. The result is a double between 0.0 (no similarity) and 1.0 (identical strings). The Jaro-Winkler metric is an extension of the Jaro similarity that gives a bonus for common prefixes (up to 4 characters), making it especially suited for short strings such as names. Changes: - `UTF8String.jaroWinklerSimilarity()` (core algorithm implementation) - `JaroWinkler` expression with codegen support - Registration as `jaro_winkler_similarity` in `FunctionRegistry` - Unit tests covering basic cases, symmetry, multi-byte strings, and null handling ### Why are the changes needed? Jaro-Winkler is one of the most commonly used string similarity metrics for record linkage, deduplication, and fuzzy matching. It is available as a built-in function in DuckDB, SQL Server 2025, and various other engines, but Spark users currently need to implement it as a UDF. A built-in function provides: - Codegen support (no UDF overhead in whole-stage codegen pipelines) - Consistency with the existing `levenshtein` function - Better discoverability for SQL users ### Does this PR introduce _any_ user-facing change? Yes. A new SQL function `jaro_winkler_similarity` is added. ### How was this patch tested? Added new tests. ### Was this patch authored or co-authored using generative AI tooling? Kiro CLI / Claude -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
