schenksj opened a new issue, #4521: URL: https://github.com/apache/datafusion-comet/issues/4521
### Background (standalone) Spark's `UnsafeRow.getUTF8String` wraps bytes with no UTF-8 validation, and `cast(BinaryType -> StringType)` is a zero-copy reinterpret — so a `StringType` column can legitimately hold arbitrary, non-UTF-8 bytes. Comet's native shuffle `get_string` decoded with `from_utf8(...).unwrap()`, which **panics** on such rows even though Spark itself treats them as opaque. ### Proposal Decode with `from_utf8_lossy`: it returns the original borrow for valid UTF-8 (zero-cost) and a `String` with `U+FFFD` replacements for invalid bytes — defined behavior, no UB. This avoids `from_utf8_unchecked`, which would construct a `&str` from arbitrary bytes (UB per the Rust reference, and would propagate into downstream Arrow ops that internally call `str::from_utf8_unchecked` on the buffer). Changes `get_string` to return `Cow<str>`; callers use it via `AsRef<str>`. ### Relationship to the Delta integration Standalone — any `binary -> string` cast routed through shuffle hits this, no Delta knowledge needed to review. **Required by the in-progress `contrib/delta` integration**: Delta Z-Order `OPTIMIZE` emits `interleave_bits(...).cast(StringType)`, which produces non-UTF-8 bytes and panics shuffle today. Please prioritize accordingly. ### Tests - Shuffle round-trip of a `StringType` column containing invalid UTF-8 bytes (e.g. from a `binary -> string` cast) returns without panicking and matches Spark's lenient semantics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
