andygrove opened a new pull request, #4763:
URL: https://github.com/apache/datafusion-comet/pull/4763

   ## Which issue does this PR close?
   
   Closes #4488.
   
   ## Rationale for this change
   
   Native `CAST(<binary> AS string)` ran through `cast_binary_formatter`, which 
used `unsafe { String::from_utf8_unchecked(value.to_vec()) }` to turn non-UTF-8 
bytes into a Rust `String`. `String` carries a documented invariant that its 
bytes are valid UTF-8, so constructing one from arbitrary bytes is undefined 
behaviour, and any downstream code that relies on that invariant can misbehave 
for non-UTF-8 input. The cast was reported as `Compatible`, so this path ran by 
default for any binary column whose contents are not strictly valid UTF-8.
   
   There is an inherent tension here: Spark's `UTF8String` tolerates arbitrary 
bytes, while Arrow's `Utf8` type nominally requires valid UTF-8. Comet cannot 
be both byte-exact with Spark and guaranteed-valid-UTF-8 in the native 
pipeline. This PR removes the undefined behaviour, keeps the byte-exact Spark 
semantics on an explicit opt-in, and makes the default path safe by falling 
back to Spark.
   
   ## What changes are included in this PR?
   
   - **Native (`cast.rs`):** remove `cast_binary_formatter` and its 
`from_utf8_unchecked` call. The plain `CAST(binary AS string)` path now 
reinterprets the binary array's offset/value/null buffers as a string array via 
`GenericStringArray::new_unchecked`, preserving Spark's exact bytes 
(`UTF8String.fromBytes`) without ever materializing an invalid `String`. The 
`ToPrettyString` formatting path (hex/base64/etc., always valid UTF-8) is 
unchanged.
   - **Serde (`CometCast.scala`):** `CAST(binary AS string)` is now reported as 
`Incompatible` by default, so it falls back to Spark and no non-UTF-8 string 
array enters the native pipeline. A dedicated opt-in, 
`spark.comet.expression.Cast.binaryToString.allowIncompatible`, enables the 
native byte-exact path on its own (the generic `Cast.allowIncompatible` 
continues to enable it as well, consistent with the other incompatible casts).
   - **Tests:** a Rust unit test asserts non-UTF-8 bytes are preserved exactly; 
`CometCastSuite` gains support-level assertions (Incompatible by default / 
Compatible with the opt-in) and a native byte-exact test under the opt-in, and 
the canonical `cast BinaryType to StringType` entry is moved to `ignore` per 
the suite's "all valid cast combinations covered" contract; the 
`cast_array_to_string` and `cast_complex_types_to_string` SQL-file tests opt in 
so their `array<binary>` / `struct<binary>` to-string cases keep exercising the 
native path.
   
   ## How are these changes tested?
   
   - Native: `cargo test -p datafusion-comet-spark-expr conversion_funcs::cast` 
(new `test_cast_binary_to_string_preserves_non_utf8_bytes` plus the existing 
cast tests), `cargo fmt`, and clippy are clean.
   - JVM: full `CometCastSuite` (163 passed, 10 ignored) and all 
`CometSqlFileTestSuite` cast files pass on Spark 4.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to