andygrove opened a new pull request, #4537:
URL: https://github.com/apache/datafusion-comet/pull/4537
## Which issue does this PR close?
Closes #4497.
## Rationale for this change
`replace(str, search, replacement)` diverges from Spark when the search
string is empty. Spark short-circuits and returns the source unchanged
(`UTF8String.replace` checks `search.numBytes == 0`). DataFusion's `replace`
instead inserts the replacement between every character and at both boundaries,
producing e.g. `replace('hello', '', 'x')` -> `xhxexlxlxox`. The bug applies to
both literal empty searches and column-valued empty strings at runtime.
## What changes are included in this PR?
- New `CometStringReplace` serde in
`spark/src/main/scala/org/apache/comet/serde/strings.scala`. By default it
routes through `CometScalaUDF.emitJvmCodegenDispatch` so Spark's own
`doGenCode` runs inside the Comet pipeline. Users can opt back into the native
(divergent) path with
`spark.comet.expression.StringReplace.allowIncompatible=true`. The pattern
mirrors `CometInitCap`.
- `QueryPlanSerde` now wires `StringReplace` to `CometStringReplace` instead
of the inline `CometScalarFunction("replace")`.
- `string_replace.sql` removes the `query ignore(#3344)` directive on the
empty-literal case, adds rows for empty-source / NULL-source /
NULL-replacement, and adds a row `('hello', '', 'x')` to the test table to
exercise the column-valued empty-search case.
## Are these changes tested?
Yes. `CometSqlFileTestSuite` runs the updated `string_replace.sql` and
passes on Spark 3.5. Compile verified for Spark 3.4 (Java 11), Spark 3.5 (Java
11), and Spark 4.0 (Java 17).
## Are there any user-facing changes?
The default behavior of `replace` becomes Spark-compatible for all
empty-search cases. Plans that previously executed `replace` natively now route
through the JVM codegen dispatcher when
`spark.comet.exec.scalaUDF.codegen.enabled=true` (the default), which is
slightly slower than the native path but matches Spark exactly. The native path
remains available behind
`spark.comet.expression.StringReplace.allowIncompatible=true`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]