andygrove opened a new issue, #4488:
URL: https://github.com/apache/datafusion-comet/issues/4488
## Describe the bug
`CAST(<binary> AS STRING)` in Comet runs through `cast_binary_formatter` in
`native/spark-expr/src/conversion_funcs/cast.rs`, which uses `unsafe {
String::from_utf8_unchecked(value.to_vec()) }` to convert non-UTF8 bytes into a
`String`. This is undefined behaviour in Rust (`String` is a documented
invariant: its bytes must be valid UTF-8), and any downstream code that relies
on that invariant (e.g. iterating over chars, slicing on char boundaries) can
misbehave for inputs that are not valid UTF-8.
Spark's `UTF8String.fromBytes` does not validate either, but it stores the
bytes in a non-`String`-typed buffer, so it does not violate any Java-level
invariant. The Comet path is the dangerous case.
Surfaced by the cast audit (collection PR queue). Today's
`CometCast.isSupported((BinaryType, StringType), ...)` returns
`Compatible(None)` so this path runs by default for any binary column whose
contents are not strictly valid UTF-8.
## Steps to reproduce
```sql
SELECT CAST(X'FF' AS STRING);
```
The result is byte-for-byte what Spark produces today (a one-byte string
holding `0xFF`), but the path through `from_utf8_unchecked` is UB and is
therefore not guaranteed to keep producing that result under future Rust
compiler / Arrow versions.
## Expected behavior
Replace the `from_utf8_unchecked` call with a safe equivalent. Options:
1. Reinterpret the underlying buffer of `BinaryArray` as a `StringArray`
without copying (Arrow stores both as the same byte layout): construct the
`StringArray` directly from the buffers without going through
`String::from_utf8_unchecked` on a freshly allocated `Vec<u8>`.
2. If a `String` is genuinely needed, use
`String::from_utf8_lossy(value).into_owned()` and accept the U+FFFD replacement
on invalid sequences (this diverges from Spark for invalid bytes but is safe).
Option 1 matches Spark exactly without UB.
## Additional context
- Native impl:
`native/spark-expr/src/conversion_funcs/cast.rs::cast_binary_formatter` (around
line 865)
- Comet matrix: `CometCast.canCastFromBinary` and `canCastToString` in
`spark/src/main/scala/org/apache/comet/expressions/CometCast.scala`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]