davidlghellin opened a new pull request, #20927:
URL: https://github.com/apache/datafusion/pull/20927

   ## Which issue does this PR close?
   
   - Part of #15914
   
   ## Rationale for this change
   
   Spark's `levenshtein(str1, str2, threshold)` supports an optional third 
argument: when the computed distance exceeds the threshold, it returns `-1` 
instead of the actual distance. DataFusion core's `levenshtein` only supports 
the 2-argument form. This PR adds the Spark-compatible version to 
`datafusion-spark`.
   
   ## What changes are included in this PR?
   
   - Add `SparkLevenshtein` UDF in 
`datafusion/spark/src/function/string/levenshtein.rs`:
     - 2-arg form: `levenshtein(str1, str2)` — standard Levenshtein distance
     - 3-arg form: `levenshtein(str1, str2, threshold)` — returns `-1` when 
distance > threshold
     - Supports Utf8, Utf8View, and LargeUtf8 (with automatic type coercion)
     - NULL threshold is treated as `0` (Spark behavior)
     - NULL string arguments return NULL
   - Register the function in `string/mod.rs`
   - Comprehensive SLT tests covering basic usage, threshold, NULL handling, 
Unicode, column expressions, per-row threshold, and mixed type coercion
   
   
   ## Are these changes tested?
   
   Yes:
   - 22 Rust unit tests in `levenshtein.rs`
   - 20 SQL logic tests in `spark/string/levenshtein.slt`
   
   ## Are there any user-facing changes?
   No breaking changes. Adds the `levenshtein` function to the Spark function 
library with optional threshold support.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to