[PR] fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode) [datafusion-comet]

via GitHub Fri, 29 May 2026 17:15:02 -0700


schenksj opened a new pull request, #4524:
URL: https://github.com/apache/datafusion-comet/pull/4524


   ## Which issue does this PR close?
   
   Closes #4521.
   
   ## Rationale for this change
   
   Spark's `UnsafeRow.getUTF8String` performs no UTF-8 validation, and 
`cast(BinaryType -> StringType)` is a zero-copy reinterpret — so a `StringType` 
column can legitimately hold arbitrary, non-UTF-8 bytes. Comet's native shuffle 
`get_string` decoded with `from_utf8(..).unwrap()`, which **panics** on such 
rows even though Spark itself treats them as opaque.
   
   ## What changes are included in this PR?
   
   `get_string` now decodes with `from_utf8_lossy` and returns `Cow<str>`: a 
zero-cost borrow for valid UTF-8, and a `String` with `U+FFFD` replacements for 
invalid bytes — defined behavior, no UB. This deliberately avoids 
`from_utf8_unchecked`, which constructs a `&str` from arbitrary bytes (UB per 
the Rust reference) and would propagate into downstream Arrow ops that 
internally call `str::from_utf8_unchecked` on the buffer. Callers pass the 
result through `append_value` (`impl AsRef<str>`), which `Cow<str>` satisfies.
   
   ## How are these changes tested?
   
   Added a standalone unit test `get_string_tolerates_non_utf8_bytes` that 
builds a Spark `UnsafeRow` with a `StringType` field holding `0xFF 0xFE 'A'`. 
It **panics without the fix** (`Result::unwrap()` on a `Utf8Error`) and 
**passes with it** (lossy-decoded to `U+FFFD U+FFFD A`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode) [datafusion-comet]

Reply via email to