Re: [PR] fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode) [datafusion-comet]

via GitHub Sun, 31 May 2026 07:12:48 -0700


andygrove commented on PR #4524:
URL: 
https://github.com/apache/datafusion-comet/pull/4524#issuecomment-4586970171


   This is a clean fix and the reasoning is excellent. Avoiding 
`from_utf8_unchecked` because it is UB and would propagate into Arrow's 
internal `str::from_utf8_unchecked` is exactly right, and replacing a panic 
with defined behavior is a clear improvement. The `Cow` return keeps the 
valid-UTF-8 path a zero-cost borrow, and every `get_string` call site feeds the 
result into `append_value` (`impl AsRef<str>`), which `Cow<str>` satisfies, so 
nothing else needs to change.
   
   On CI: the one failing check (`Spark 3.5 [expressions]`) is pure 
infrastructure, not this PR. The job could not download Maven itself (`HTTP 403 
for .../apache-maven-3.9.6-bin.zip`). A re-run should clear it.
   
   One compatibility nuance worth raising. The description says the lossy 
decode matches how Spark renders the same bytes. That holds for single invalid 
bytes, which is what the tests cover, but it is not universally true. Spark 
renders `StringType` bytes on the driver via `UTF8String.toString()`, which is 
`new String(bytes, StandardCharsets.UTF_8)`, so Java's decoder is the 
reference, and Rust's `from_utf8_lossy` uses a different replacement 
granularity for some multi-byte sequences.
   
   I compared the two directly (Rust `from_utf8_lossy` vs JDK 17 `new 
String(bytes, UTF_8)`):
   
   | Bytes | Rust | Java (Spark) | Match |
   |---|---|---|---|
   | `FF FE 41` | `FFFD FFFD A` | `FFFD FFFD A` | yes |
   | `80 42` | `FFFD B` | `FFFD B` | yes |
   | `E0 80` | `FFFD FFFD` | `FFFD FFFD` | yes |
   | `F0 80 80 41` | `FFFD FFFD FFFD A` | `FFFD FFFD FFFD A` | yes |
   | `C0 AF` | `FFFD FFFD` | `FFFD FFFD` | yes |
   | `ED A0 80` (surrogate) | `FFFD FFFD FFFD` | `FFFD` | **no** |
   | `F4 90 80 80` | `FFFD FFFD FFFD FFFD` | `FFFD FFFD FFFD FFFD` | yes |
   
   So for surrogate-range sequences (`ED A0..BF ...`), Rust emits one `U+FFFD` 
per byte while Java collapses the whole ill-formed unit into a single `U+FFFD`. 
That pattern is reachable through the same `cast(BinaryType -> StringType)` 
path this issue is about, and it shows up in real data that went through CESU-8 
or Java modified-UTF-8 (surrogate-pair encoded supplementary characters), which 
some Hadoop and Hive paths produce. The practical effect is that a Comet 
columnar shuffle of such a column would produce `FFFD FFFD FFFD` while Spark's 
own shuffle produces `FFFD`, so the answers would differ.
   
   This is not a blocker. Not crashing is clearly better than the status quo. 
But the "matches Spark" claim has this exception, so it would be good to either 
note the limitation or, if exact parity matters, decode with the same 
maximal-subpart strategy Java uses.
   
   On tests: the unit test is a nice low-level reproducer, and the end-to-end 
test is thoughtfully done, especially disabling Comet's own `Cast` so the raw 
bytes actually reach the JVM shuffle inside an `UnsafeRow`. Since all the 
tested sequences are single invalid bytes (where Rust and Java agree), would it 
be worth adding a case like `ED A0 80`? With the current `from_utf8_lossy` that 
row would decode to three `U+FFFD` in Comet but one in Spark, so it would show 
whether the answers still line up on the surrogate path.
   
   *Disclosure: I used Claude Code to help review this PR, including running 
the Rust-vs-Java decoder comparison above.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(shuffle): tolerate non-UTF-8 bytes in get_string (lossy decode) [datafusion-comet]

Reply via email to