Re: [PR] feat: Support Spark levenshtein expression in native execution [datafusion-comet]

via GitHub Wed, 01 Jul 2026 06:25:33 -0700


andygrove commented on PR #4105:
URL: 
https://github.com/apache/datafusion-comet/pull/4105#issuecomment-4855415428


   Thanks for the persistence on this one @Myx778, and thanks especially for 
calling out the codegen-dispatch vs native decision explicitly. A few thoughts 
after going through the diff and checking the Rust against Spark's 
`UTF8String.levenshteinDistance`.
   
   **Correctness looks good.** I traced the threshold logic against Spark's 
banded `limitedCompare` algorithm. Spark returns the true distance `D` when `D 
<= threshold` and `-1` otherwise, and the PR's `dist > t ? -1 : dist` is 
exactly equivalent for all threshold values including negatives and the 
empty-string cases. NULL propagation matches Spark too. Nice work on the test 
coverage here.
   
   **The main thing to weigh is whether the native path is worth it.** `main` 
already routes `Levenshtein` through `CometCodegenDispatch`, which runs Spark's 
own `doGenCode` inside the kernel and is Spark-exact with no correctness risk. 
The benchmark numbers show this native version is roughly on par with Spark 
(~1.0X, and 0.9X for scan-only) rather than faster. Could you spell out the 
concrete motivation for the native path? For example, that it keeps the 
pipeline native when `COMET_SCALA_UDF_CODEGEN_ENABLED` is off. That helps 
justify replacing a zero-risk path with a hand-written Rust implementation to 
maintain.
   
   **A couple of smaller code points:**
   
   - In `spark_levenshtein`, could we use 
`as_generic_string_array::<i32>(...)?` instead of `as_string_array(...)`? The 
latter panics on a type mismatch and only handles `Utf8`, whereas the rest of 
`string_funcs/` (see `split.rs`, `regexp_extract_all.rs`) uses the fallible 
form and also handles `LargeUtf8`.
   - `getSupportLevel` checks collation on `children.headOption` only. Since 
the native code does raw char comparison and ignores collation, checking all 
string children (`children.exists(...)`) would be safer for the case where only 
the right argument is collated.
   - The new `CometStringExpressionSuite` cases overlap heavily with the SQL 
file tests. Since we prefer `CometSqlFileTestSuite` for expression coverage, 
would you be open to trimming the Scala duplicates?
   
   Also the PR is showing merge conflicts again, so it will need another rebase 
onto `main` before CI is meaningful.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Support Spark levenshtein expression in native execution [datafusion-comet]

Reply via email to