[PR] fix: route decode through codegen dispatcher to honor Spark 4.0 legacy flags [datafusion-comet]

via GitHub Fri, 12 Jun 2026 06:41:04 -0700


andygrove opened a new pull request, #4639:
URL: https://github.com/apache/datafusion-comet/pull/4639


   ## Which issue does this PR close?
   
   Closes #4465.
   
   ## Rationale for this change
   
   Comet currently lowers `decode(bin, charset)` to a TRY-mode binary→string 
cast, which produces the wrong result for invalid byte sequences in every 
supported Spark version:
   
   | Spark version / mode | Spark behavior | Comet behavior (pre-fix) |
   | --- | --- | --- |
   | 3.x | Substitutes Unicode replacement char | NULL / undefined |
   | 4.0, `legacyErrorAction = true` | Same as 3.x | NULL / undefined |
   | 4.0 default (`legacyErrorAction = false`) | Raises 
`MALFORMED_CHARACTER_CODING` | NULL / undefined |
   
   The Spark 4.0 shim destructured the `StaticInvoke` arguments and discarded 
both `legacyCharsets` / `legacyErrorAction` flags, so neither the substitute 
nor the strict-error path was reachable.
   
   ## What changes are included in this PR?
   
   - Spark 4.0+ shim (`Spark4xCometExprShim`) now routes the 
`StringDecode.decode` `StaticInvoke` through 
`CometScalaUDF.emitJvmCodegenDispatch` instead of the buggy TRY-cast helper. 
The dispatcher runs Spark's own `doGenCode` inside the Comet pipeline, so the 
captured `legacyCharsets` / `legacyErrorAction` literals are honored exactly.
   - Spark 3.x `CometStringDecode` collapses to `extends 
CometCodegenDispatch[StringDecode]`, so the 3.x replacement-character behavior 
also matches Spark exactly. The per-expression 
`spark.comet.expression.StringDecode.enabled` flag still works because the 
framework checks it before calling `convert`.
   - Removes the now-unused `CommonStringExprs.stringDecode` helper and the 
`CommonStringExprs` mixin from the 3.x / 4.x shims.
   - Updates `decode.sql` to drop obsolete `expect_fallback` lines (the 
dispatcher handles non-UTF-8 charsets too) and adds two new fixtures:
     - `decode_invalid_utf8.sql` — replacement-character behavior on Spark 3.x 
default and Spark 4.0+ with both legacy flags enabled.
     - `decode_invalid_utf8_strict.sql` — Spark 4.0+ default strict mode with 
`expect_error(MALFORMED_CHARACTER_CODING)`, including a sentinel valid-input 
query so the assertion does not pass vacuously through an operator-level 
fallback.
   
   ## How are these changes tested?
   
   - `mvn compile` clean on Spark 3.4 / 3.5 / 4.0 / 4.1 (spotless + scalastyle 
clean).
   - `CometSqlFileTestSuite decode` passes on Spark 3.4, 3.5, and 4.0; the 
strict-mode fixture is skipped on 3.x via `MinSparkVersion: 4.0` and runs on 
4.0.
   - Existing `CometStringDecodeSuite` (Spark 3.x only) still passes, 
confirming the per-expression enable flag continues to work after the refactor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: route decode through codegen dispatcher to honor Spark 4.0 legacy flags [datafusion-comet]

Reply via email to