andygrove opened a new pull request, #4639:
URL: https://github.com/apache/datafusion-comet/pull/4639
## Which issue does this PR close?
Closes #4465.
## Rationale for this change
Comet currently lowers `decode(bin, charset)` to a TRY-mode binary→string
cast, which produces the wrong result for invalid byte sequences in every
supported Spark version:
| Spark version / mode | Spark behavior | Comet behavior (pre-fix) |
| --- | --- | --- |
| 3.x | Substitutes Unicode replacement char | NULL / undefined |
| 4.0, `legacyErrorAction = true` | Same as 3.x | NULL / undefined |
| 4.0 default (`legacyErrorAction = false`) | Raises
`MALFORMED_CHARACTER_CODING` | NULL / undefined |
The Spark 4.0 shim destructured the `StaticInvoke` arguments and discarded
both `legacyCharsets` / `legacyErrorAction` flags, so neither the substitute
nor the strict-error path was reachable.
## What changes are included in this PR?
- Spark 4.0+ shim (`Spark4xCometExprShim`) now routes the
`StringDecode.decode` `StaticInvoke` through
`CometScalaUDF.emitJvmCodegenDispatch` instead of the buggy TRY-cast helper.
The dispatcher runs Spark's own `doGenCode` inside the Comet pipeline, so the
captured `legacyCharsets` / `legacyErrorAction` literals are honored exactly.
- Spark 3.x `CometStringDecode` collapses to `extends
CometCodegenDispatch[StringDecode]`, so the 3.x replacement-character behavior
also matches Spark exactly. The per-expression
`spark.comet.expression.StringDecode.enabled` flag still works because the
framework checks it before calling `convert`.
- Removes the now-unused `CommonStringExprs.stringDecode` helper and the
`CommonStringExprs` mixin from the 3.x / 4.x shims.
- Updates `decode.sql` to drop obsolete `expect_fallback` lines (the
dispatcher handles non-UTF-8 charsets too) and adds two new fixtures:
- `decode_invalid_utf8.sql` — replacement-character behavior on Spark 3.x
default and Spark 4.0+ with both legacy flags enabled.
- `decode_invalid_utf8_strict.sql` — Spark 4.0+ default strict mode with
`expect_error(MALFORMED_CHARACTER_CODING)`, including a sentinel valid-input
query so the assertion does not pass vacuously through an operator-level
fallback.
## How are these changes tested?
- `mvn compile` clean on Spark 3.4 / 3.5 / 4.0 / 4.1 (spotless + scalastyle
clean).
- `CometSqlFileTestSuite decode` passes on Spark 3.4, 3.5, and 4.0; the
strict-mode fixture is skipped on 3.x via `MinSparkVersion: 4.0` and runs on
4.0.
- Existing `CometStringDecodeSuite` (Spark 3.x only) still passes,
confirming the per-expression enable flag continues to work after the refactor.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]