Re: [PR] fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode) [datafusion-comet]

via GitHub Tue, 02 Jun 2026 12:16:19 -0700


andygrove commented on code in PR #4524:
URL: https://github.com/apache/datafusion-comet/pull/4524#discussion_r3343799845



##########
native/shuffle/src/spark_unsafe/unsafe_object.rs:
##########
@@ -19,10 +19,125 @@ use super::list::SparkUnsafeArray;
 use super::map::SparkUnsafeMap;
 use super::row::SparkUnsafeRow;
 use datafusion_comet_common::bytes_to_i128;
-use std::str::from_utf8;
+use std::borrow::Cow;
 
 const MAX_LONG_DIGITS: u8 = 18;
 
+/// Decode `bytes` as UTF-8 the way Spark renders `StringType` -- `new 
String(bytes, UTF_8)` on the
+/// JVM -- replacing each ill-formed sequence with a single `U+FFFD` and 
skipping the same number of
+/// bytes the JDK's UTF-8 `CharsetDecoder` (action REPLACE) would. Valid UTF-8 
is returned as a
+/// zero-cost borrow.
+///
+/// This intentionally differs from `str::from_utf8_lossy` for surrogate-range 
three-byte sequences
+/// (`ED A0..BF ..`, e.g. CESU-8 / Java modified-UTF-8 supplementary chars) 
and for some other
+/// ill-formed multi-byte units: `from_utf8_lossy` follows the Unicode 
"maximal subpart" rule and
+/// can emit one `U+FFFD` per byte, whereas the JDK collapses certain 
ill-formed units into a single
+/// `U+FFFD`. Matching the JDK byte-for-byte means a Comet columnar shuffle of 
arbitrary bytes
+/// renders identically to a Spark JVM shuffle. The per-class malformed 
lengths below mirror
+/// `sun.nio.cs.UTF_8.Decoder` (E0/ED overlong & surrogate handling, F0/F4 
range checks).
+pub(crate) fn decode_utf8_spark_lossy(bytes: &[u8]) -> Cow<'_, str> {

Review Comment:
   The decoder looks like an independent, spec-derived implementation (the 
structure is idiomatic Rust and bears no resemblance to the JDK's NIO 
CoderResult decoder), and the oracle test confirms you derived the behavior 
from observed JVM output rather than the source. Just to keep provenance clean 
for ASF: can you confirm this was written from the Unicode/RFC 3629 spec plus 
observed new String(bytes, UTF_8) outputs, not adapted from the OpenJDK source? 
It might also be worth softening the two "mirror sun.nio.cs.UTF_8.Decoder" 
comments to something like "matches the observable replacement behavior of the 
JDK UTF-8 decoder," so the comments describe behavior rather than implying the 
code derives from that class.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode) [datafusion-comet]

Reply via email to