schenksj commented on PR #4524: URL: https://github.com/apache/datafusion-comet/pull/4524#issuecomment-4587515871
Thanks @andygrove — and for running the Rust-vs-JDK decoder comparison; that surrogate-path divergence was exactly the kind of silent answer-mismatch worth closing off, so I went for exact parity rather than a noted limitation. `get_string` now decodes via a new `decode_utf8_spark_lossy` that mirrors `sun.nio.cs.UTF_8.Decoder` (action REPLACE) byte-for-byte — same per-class malformed lengths the JDK uses (E0/ED overlong & surrogate handling, F0/F4 range checks) — so the `U+FFFD` granularity matches `new String(bytes, UTF_8)`. Valid UTF-8 still takes the fast path and borrows zero-copy. I turned your comparison table into the test oracle: `matches_jvm_replacement_granularity` asserts all 7 of your rows, and the surrogate case `ED A0 80` now decodes to a single `U+FFFD` (was three under `from_utf8_lossy`), matching Spark. Verified against JDK 17. Agreed the failing `Spark 3.5 [expressions]` check is the Maven-download 403 infra flake, not this PR — a re-run should clear it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
