schenksj opened a new pull request, #4524: URL: https://github.com/apache/datafusion-comet/pull/4524
## Which issue does this PR close? Closes #4521. ## Rationale for this change Spark's `UnsafeRow.getUTF8String` performs no UTF-8 validation, and `cast(BinaryType -> StringType)` is a zero-copy reinterpret — so a `StringType` column can legitimately hold arbitrary, non-UTF-8 bytes. Comet's native shuffle `get_string` decoded with `from_utf8(..).unwrap()`, which **panics** on such rows even though Spark itself treats them as opaque. ## What changes are included in this PR? `get_string` now decodes with `from_utf8_lossy` and returns `Cow<str>`: a zero-cost borrow for valid UTF-8, and a `String` with `U+FFFD` replacements for invalid bytes — defined behavior, no UB. This deliberately avoids `from_utf8_unchecked`, which constructs a `&str` from arbitrary bytes (UB per the Rust reference) and would propagate into downstream Arrow ops that internally call `str::from_utf8_unchecked` on the buffer. Callers pass the result through `append_value` (`impl AsRef<str>`), which `Cow<str>` satisfies. ## How are these changes tested? Added a standalone unit test `get_string_tolerates_non_utf8_bytes` that builds a Spark `UnsafeRow` with a `StringType` field holding `0xFF 0xFE 'A'`. It **panics without the fix** (`Result::unwrap()` on a `Utf8Error`) and **passes with it** (lossy-decoded to `U+FFFD U+FFFD A`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
