andygrove opened a new pull request, #4763:
URL: https://github.com/apache/datafusion-comet/pull/4763
## Which issue does this PR close?
Closes #4488.
## Rationale for this change
Native `CAST(<binary> AS string)` ran through `cast_binary_formatter`, which
used `unsafe { String::from_utf8_unchecked(value.to_vec()) }` to turn non-UTF-8
bytes into a Rust `String`. `String` carries a documented invariant that its
bytes are valid UTF-8, so constructing one from arbitrary bytes is undefined
behaviour, and any downstream code that relies on that invariant can misbehave
for non-UTF-8 input. The cast was reported as `Compatible`, so this path ran by
default for any binary column whose contents are not strictly valid UTF-8.
There is an inherent tension here: Spark's `UTF8String` tolerates arbitrary
bytes, while Arrow's `Utf8` type nominally requires valid UTF-8. Comet cannot
be both byte-exact with Spark and guaranteed-valid-UTF-8 in the native
pipeline. This PR removes the undefined behaviour, keeps the byte-exact Spark
semantics on an explicit opt-in, and makes the default path safe by falling
back to Spark.
## What changes are included in this PR?
- **Native (`cast.rs`):** remove `cast_binary_formatter` and its
`from_utf8_unchecked` call. The plain `CAST(binary AS string)` path now
reinterprets the binary array's offset/value/null buffers as a string array via
`GenericStringArray::new_unchecked`, preserving Spark's exact bytes
(`UTF8String.fromBytes`) without ever materializing an invalid `String`. The
`ToPrettyString` formatting path (hex/base64/etc., always valid UTF-8) is
unchanged.
- **Serde (`CometCast.scala`):** `CAST(binary AS string)` is now reported as
`Incompatible` by default, so it falls back to Spark and no non-UTF-8 string
array enters the native pipeline. A dedicated opt-in,
`spark.comet.expression.Cast.binaryToString.allowIncompatible`, enables the
native byte-exact path on its own (the generic `Cast.allowIncompatible`
continues to enable it as well, consistent with the other incompatible casts).
- **Tests:** a Rust unit test asserts non-UTF-8 bytes are preserved exactly;
`CometCastSuite` gains support-level assertions (Incompatible by default /
Compatible with the opt-in) and a native byte-exact test under the opt-in, and
the canonical `cast BinaryType to StringType` entry is moved to `ignore` per
the suite's "all valid cast combinations covered" contract; the
`cast_array_to_string` and `cast_complex_types_to_string` SQL-file tests opt in
so their `array<binary>` / `struct<binary>` to-string cases keep exercising the
native path.
## How are these changes tested?
- Native: `cargo test -p datafusion-comet-spark-expr conversion_funcs::cast`
(new `test_cast_binary_to_string_preserves_non_utf8_bytes` plus the existing
cast tests), `cargo fmt`, and clippy are clean.
- JVM: full `CometCastSuite` (163 passed, 10 ignored) and all
`CometSqlFileTestSuite` cast files pass on Spark 4.1.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]