andygrove opened a new issue, #4488:
URL: https://github.com/apache/datafusion-comet/issues/4488

   ## Describe the bug
   
   `CAST(<binary> AS STRING)` in Comet runs through `cast_binary_formatter` in 
`native/spark-expr/src/conversion_funcs/cast.rs`, which uses `unsafe { 
String::from_utf8_unchecked(value.to_vec()) }` to convert non-UTF8 bytes into a 
`String`. This is undefined behaviour in Rust (`String` is a documented 
invariant: its bytes must be valid UTF-8), and any downstream code that relies 
on that invariant (e.g. iterating over chars, slicing on char boundaries) can 
misbehave for inputs that are not valid UTF-8.
   
   Spark's `UTF8String.fromBytes` does not validate either, but it stores the 
bytes in a non-`String`-typed buffer, so it does not violate any Java-level 
invariant. The Comet path is the dangerous case.
   
   Surfaced by the cast audit (collection PR queue). Today's 
`CometCast.isSupported((BinaryType, StringType), ...)` returns 
`Compatible(None)` so this path runs by default for any binary column whose 
contents are not strictly valid UTF-8.
   
   ## Steps to reproduce
   
   ```sql
   SELECT CAST(X'FF' AS STRING);
   ```
   
   The result is byte-for-byte what Spark produces today (a one-byte string 
holding `0xFF`), but the path through `from_utf8_unchecked` is UB and is 
therefore not guaranteed to keep producing that result under future Rust 
compiler / Arrow versions.
   
   ## Expected behavior
   
   Replace the `from_utf8_unchecked` call with a safe equivalent. Options:
   
   1. Reinterpret the underlying buffer of `BinaryArray` as a `StringArray` 
without copying (Arrow stores both as the same byte layout): construct the 
`StringArray` directly from the buffers without going through 
`String::from_utf8_unchecked` on a freshly allocated `Vec<u8>`.
   2. If a `String` is genuinely needed, use 
`String::from_utf8_lossy(value).into_owned()` and accept the U+FFFD replacement 
on invalid sequences (this diverges from Spark for invalid bytes but is safe).
   
   Option 1 matches Spark exactly without UB.
   
   ## Additional context
   
   - Native impl: 
`native/spark-expr/src/conversion_funcs/cast.rs::cast_binary_formatter` (around 
line 865)
   - Comet matrix: `CometCast.canCastFromBinary` and `canCastToString` in 
`spark/src/main/scala/org/apache/comet/expressions/CometCast.scala`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to