[PR] fix: route StringReplace through codegen dispatcher to fix empty-search divergence [datafusion-comet]

via GitHub Sat, 30 May 2026 09:34:58 -0700


andygrove opened a new pull request, #4537:
URL: https://github.com/apache/datafusion-comet/pull/4537


   ## Which issue does this PR close?
   
   Closes #4497.
   
   ## Rationale for this change
   
   `replace(str, search, replacement)` diverges from Spark when the search 
string is empty. Spark short-circuits and returns the source unchanged 
(`UTF8String.replace` checks `search.numBytes == 0`). DataFusion's `replace` 
instead inserts the replacement between every character and at both boundaries, 
producing e.g. `replace('hello', '', 'x')` -> `xhxexlxlxox`. The bug applies to 
both literal empty searches and column-valued empty strings at runtime.
   
   ## What changes are included in this PR?
   
   - New `CometStringReplace` serde in 
`spark/src/main/scala/org/apache/comet/serde/strings.scala`. By default it 
routes through `CometScalaUDF.emitJvmCodegenDispatch` so Spark's own 
`doGenCode` runs inside the Comet pipeline. Users can opt back into the native 
(divergent) path with 
`spark.comet.expression.StringReplace.allowIncompatible=true`. The pattern 
mirrors `CometInitCap`.
   - `QueryPlanSerde` now wires `StringReplace` to `CometStringReplace` instead 
of the inline `CometScalarFunction("replace")`.
   - `string_replace.sql` removes the `query ignore(#3344)` directive on the 
empty-literal case, adds rows for empty-source / NULL-source / 
NULL-replacement, and adds a row `('hello', '', 'x')` to the test table to 
exercise the column-valued empty-search case.
   
   ## Are these changes tested?
   
   Yes. `CometSqlFileTestSuite` runs the updated `string_replace.sql` and 
passes on Spark 3.5. Compile verified for Spark 3.4 (Java 11), Spark 3.5 (Java 
11), and Spark 4.0 (Java 17).
   
   ## Are there any user-facing changes?
   
   The default behavior of `replace` becomes Spark-compatible for all 
empty-search cases. Plans that previously executed `replace` natively now route 
through the JVM codegen dispatcher when 
`spark.comet.exec.scalaUDF.codegen.enabled=true` (the default), which is 
slightly slower than the native path but matches Spark exactly. The native path 
remains available behind 
`spark.comet.expression.StringReplace.allowIncompatible=true`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: route StringReplace through codegen dispatcher to fix empty-search divergence [datafusion-comet]

Reply via email to