schenksj commented on PR #4524:
URL: 
https://github.com/apache/datafusion-comet/pull/4524#issuecomment-4587515871

   Thanks @andygrove — and for running the Rust-vs-JDK decoder comparison; that 
surrogate-path divergence was exactly the kind of silent answer-mismatch worth 
closing off, so I went for exact parity rather than a noted limitation.
   
   `get_string` now decodes via a new `decode_utf8_spark_lossy` that mirrors 
`sun.nio.cs.UTF_8.Decoder` (action REPLACE) byte-for-byte — same per-class 
malformed lengths the JDK uses (E0/ED overlong & surrogate handling, F0/F4 
range checks) — so the `U+FFFD` granularity matches `new String(bytes, UTF_8)`. 
Valid UTF-8 still takes the fast path and borrows zero-copy.
   
   I turned your comparison table into the test oracle: 
`matches_jvm_replacement_granularity` asserts all 7 of your rows, and the 
surrogate case `ED A0 80` now decodes to a single `U+FFFD` (was three under 
`from_utf8_lossy`), matching Spark. Verified against JDK 17.
   
   Agreed the failing `Spark 3.5 [expressions]` check is the Maven-download 403 
infra flake, not this PR — a re-run should clear it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to