andygrove commented on PR #4105: URL: https://github.com/apache/datafusion-comet/pull/4105#issuecomment-4855415428
Thanks for the persistence on this one @Myx778, and thanks especially for calling out the codegen-dispatch vs native decision explicitly. A few thoughts after going through the diff and checking the Rust against Spark's `UTF8String.levenshteinDistance`. **Correctness looks good.** I traced the threshold logic against Spark's banded `limitedCompare` algorithm. Spark returns the true distance `D` when `D <= threshold` and `-1` otherwise, and the PR's `dist > t ? -1 : dist` is exactly equivalent for all threshold values including negatives and the empty-string cases. NULL propagation matches Spark too. Nice work on the test coverage here. **The main thing to weigh is whether the native path is worth it.** `main` already routes `Levenshtein` through `CometCodegenDispatch`, which runs Spark's own `doGenCode` inside the kernel and is Spark-exact with no correctness risk. The benchmark numbers show this native version is roughly on par with Spark (~1.0X, and 0.9X for scan-only) rather than faster. Could you spell out the concrete motivation for the native path? For example, that it keeps the pipeline native when `COMET_SCALA_UDF_CODEGEN_ENABLED` is off. That helps justify replacing a zero-risk path with a hand-written Rust implementation to maintain. **A couple of smaller code points:** - In `spark_levenshtein`, could we use `as_generic_string_array::<i32>(...)?` instead of `as_string_array(...)`? The latter panics on a type mismatch and only handles `Utf8`, whereas the rest of `string_funcs/` (see `split.rs`, `regexp_extract_all.rs`) uses the fallible form and also handles `LargeUtf8`. - `getSupportLevel` checks collation on `children.headOption` only. Since the native code does raw char comparison and ignores collation, checking all string children (`children.exists(...)`) would be safer for the case where only the right argument is collated. - The new `CometStringExpressionSuite` cases overlap heavily with the SQL file tests. Since we prefer `CometSqlFileTestSuite` for expression coverage, would you be open to trimming the Scala duplicates? Also the PR is showing merge conflicts again, so it will need another rebase onto `main` before CI is meaningful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
