andygrove commented on code in PR #4524:
URL: https://github.com/apache/datafusion-comet/pull/4524#discussion_r3343799845
##########
native/shuffle/src/spark_unsafe/unsafe_object.rs:
##########
@@ -19,10 +19,125 @@ use super::list::SparkUnsafeArray;
use super::map::SparkUnsafeMap;
use super::row::SparkUnsafeRow;
use datafusion_comet_common::bytes_to_i128;
-use std::str::from_utf8;
+use std::borrow::Cow;
const MAX_LONG_DIGITS: u8 = 18;
+/// Decode `bytes` as UTF-8 the way Spark renders `StringType` -- `new
String(bytes, UTF_8)` on the
+/// JVM -- replacing each ill-formed sequence with a single `U+FFFD` and
skipping the same number of
+/// bytes the JDK's UTF-8 `CharsetDecoder` (action REPLACE) would. Valid UTF-8
is returned as a
+/// zero-cost borrow.
+///
+/// This intentionally differs from `str::from_utf8_lossy` for surrogate-range
three-byte sequences
+/// (`ED A0..BF ..`, e.g. CESU-8 / Java modified-UTF-8 supplementary chars)
and for some other
+/// ill-formed multi-byte units: `from_utf8_lossy` follows the Unicode
"maximal subpart" rule and
+/// can emit one `U+FFFD` per byte, whereas the JDK collapses certain
ill-formed units into a single
+/// `U+FFFD`. Matching the JDK byte-for-byte means a Comet columnar shuffle of
arbitrary bytes
+/// renders identically to a Spark JVM shuffle. The per-class malformed
lengths below mirror
+/// `sun.nio.cs.UTF_8.Decoder` (E0/ED overlong & surrogate handling, F0/F4
range checks).
+pub(crate) fn decode_utf8_spark_lossy(bytes: &[u8]) -> Cow<'_, str> {
Review Comment:
The decoder looks like an independent, spec-derived implementation (the
structure is idiomatic Rust and bears no resemblance to the JDK's NIO
CoderResult decoder), and the oracle test confirms you derived the behavior
from observed JVM output rather than the source. Just to keep provenance clean
for ASF: can you confirm this was written from the Unicode/RFC 3629 spec plus
observed new String(bytes, UTF_8) outputs, not adapted from the OpenJDK source?
It might also be worth softening the two "mirror sun.nio.cs.UTF_8.Decoder"
comments to something like "matches the observable replacement behavior of the
JDK UTF-8 decoder," so the comments describe behavior rather than implying the
code derives from that class.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]