[I] native shuffle: get_string should not panic on non-UTF-8 bytes (use lossy decode) [datafusion-comet]

via GitHub Fri, 29 May 2026 16:56:35 -0700


schenksj opened a new issue, #4521:
URL: https://github.com/apache/datafusion-comet/issues/4521


   ### Background (standalone)
   
   Spark's `UnsafeRow.getUTF8String` wraps bytes with no UTF-8 validation, and 
`cast(BinaryType -> StringType)` is a zero-copy reinterpret — so a `StringType` 
column can legitimately hold arbitrary, non-UTF-8 bytes. Comet's native shuffle 
`get_string` decoded with `from_utf8(...).unwrap()`, which **panics** on such 
rows even though Spark itself treats them as opaque.
   
   ### Proposal
   
   Decode with `from_utf8_lossy`: it returns the original borrow for valid 
UTF-8 (zero-cost) and a `String` with `U+FFFD` replacements for invalid bytes — 
defined behavior, no UB. This avoids `from_utf8_unchecked`, which would 
construct a `&str` from arbitrary bytes (UB per the Rust reference, and would 
propagate into downstream Arrow ops that internally call 
`str::from_utf8_unchecked` on the buffer). Changes `get_string` to return 
`Cow<str>`; callers use it via `AsRef<str>`.
   
   ### Relationship to the Delta integration
   
   Standalone — any `binary -> string` cast routed through shuffle hits this, 
no Delta knowledge needed to review. **Required by the in-progress 
`contrib/delta` integration**: Delta Z-Order `OPTIMIZE` emits 
`interleave_bits(...).cast(StringType)`, which produces non-UTF-8 bytes and 
panics shuffle today. Please prioritize accordingly.
   
   ### Tests
   
   - Shuffle round-trip of a `StringType` column containing invalid UTF-8 bytes 
(e.g. from a `binary -> string` cast) returns without panicking and matches 
Spark's lenient semantics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] native shuffle: get_string should not panic on non-UTF-8 bytes (use lossy decode) [datafusion-comet]

Reply via email to