Re: [PR] fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode) [datafusion-comet]

via GitHub Tue, 02 Jun 2026 18:07:37 -0700


schenksj commented on code in PR #4524:
URL: https://github.com/apache/datafusion-comet/pull/4524#discussion_r3345345852



##########
native/shuffle/src/spark_unsafe/unsafe_object.rs:
##########
@@ -19,10 +19,125 @@ use super::list::SparkUnsafeArray;
 use super::map::SparkUnsafeMap;
 use super::row::SparkUnsafeRow;
 use datafusion_comet_common::bytes_to_i128;
-use std::str::from_utf8;
+use std::borrow::Cow;
 
 const MAX_LONG_DIGITS: u8 = 18;
 
+/// Decode `bytes` as UTF-8 the way Spark renders `StringType` -- `new 
String(bytes, UTF_8)` on the
+/// JVM -- replacing each ill-formed sequence with a single `U+FFFD` and 
skipping the same number of
+/// bytes the JDK's UTF-8 `CharsetDecoder` (action REPLACE) would. Valid UTF-8 
is returned as a
+/// zero-cost borrow.
+///
+/// This intentionally differs from `str::from_utf8_lossy` for surrogate-range 
three-byte sequences
+/// (`ED A0..BF ..`, e.g. CESU-8 / Java modified-UTF-8 supplementary chars) 
and for some other
+/// ill-formed multi-byte units: `from_utf8_lossy` follows the Unicode 
"maximal subpart" rule and
+/// can emit one `U+FFFD` per byte, whereas the JDK collapses certain 
ill-formed units into a single
+/// `U+FFFD`. Matching the JDK byte-for-byte means a Comet columnar shuffle of 
arbitrary bytes
+/// renders identically to a Spark JVM shuffle. The per-class malformed 
lengths below mirror
+/// `sun.nio.cs.UTF_8.Decoder` (E0/ED overlong & surrogate handling, F0/F4 
range checks).
+pub(crate) fn decode_utf8_spark_lossy(bytes: &[u8]) -> Cow<'_, str> {

Review Comment:
     Thanks @andygrove, happy to keep the provenance clear.
     
     I can confirm the decoder was written from observed behavior, not from the 
OpenJDK source. The per-class malformed lengths came from running `new 
String(bytes, UTF_8)` on a JDK 17 JVM and matching its
     output byte for byte, which is what the oracle test pins. I did not read 
`sun.nio.cs.UTF_8.Decoder` or any other OpenJDK source while writing it.
     
     I also reworded the comment in `ed1dd5d9f` so it describes the observable 
replacement behavior of the JDK UTF-8 decoder rather than implying the lengths 
mirror that class. It now states they were
     determined from observed output, not by reviewing the OpenJDK source.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode) [datafusion-comet]

Reply via email to