andygrove commented on PR #4524: URL: https://github.com/apache/datafusion-comet/pull/4524#issuecomment-4586970171
This is a clean fix and the reasoning is excellent. Avoiding `from_utf8_unchecked` because it is UB and would propagate into Arrow's internal `str::from_utf8_unchecked` is exactly right, and replacing a panic with defined behavior is a clear improvement. The `Cow` return keeps the valid-UTF-8 path a zero-cost borrow, and every `get_string` call site feeds the result into `append_value` (`impl AsRef<str>`), which `Cow<str>` satisfies, so nothing else needs to change. On CI: the one failing check (`Spark 3.5 [expressions]`) is pure infrastructure, not this PR. The job could not download Maven itself (`HTTP 403 for .../apache-maven-3.9.6-bin.zip`). A re-run should clear it. One compatibility nuance worth raising. The description says the lossy decode matches how Spark renders the same bytes. That holds for single invalid bytes, which is what the tests cover, but it is not universally true. Spark renders `StringType` bytes on the driver via `UTF8String.toString()`, which is `new String(bytes, StandardCharsets.UTF_8)`, so Java's decoder is the reference, and Rust's `from_utf8_lossy` uses a different replacement granularity for some multi-byte sequences. I compared the two directly (Rust `from_utf8_lossy` vs JDK 17 `new String(bytes, UTF_8)`): | Bytes | Rust | Java (Spark) | Match | |---|---|---|---| | `FF FE 41` | `FFFD FFFD A` | `FFFD FFFD A` | yes | | `80 42` | `FFFD B` | `FFFD B` | yes | | `E0 80` | `FFFD FFFD` | `FFFD FFFD` | yes | | `F0 80 80 41` | `FFFD FFFD FFFD A` | `FFFD FFFD FFFD A` | yes | | `C0 AF` | `FFFD FFFD` | `FFFD FFFD` | yes | | `ED A0 80` (surrogate) | `FFFD FFFD FFFD` | `FFFD` | **no** | | `F4 90 80 80` | `FFFD FFFD FFFD FFFD` | `FFFD FFFD FFFD FFFD` | yes | So for surrogate-range sequences (`ED A0..BF ...`), Rust emits one `U+FFFD` per byte while Java collapses the whole ill-formed unit into a single `U+FFFD`. That pattern is reachable through the same `cast(BinaryType -> StringType)` path this issue is about, and it shows up in real data that went through CESU-8 or Java modified-UTF-8 (surrogate-pair encoded supplementary characters), which some Hadoop and Hive paths produce. The practical effect is that a Comet columnar shuffle of such a column would produce `FFFD FFFD FFFD` while Spark's own shuffle produces `FFFD`, so the answers would differ. This is not a blocker. Not crashing is clearly better than the status quo. But the "matches Spark" claim has this exception, so it would be good to either note the limitation or, if exact parity matters, decode with the same maximal-subpart strategy Java uses. On tests: the unit test is a nice low-level reproducer, and the end-to-end test is thoughtfully done, especially disabling Comet's own `Cast` so the raw bytes actually reach the JVM shuffle inside an `UnsafeRow`. Since all the tested sequences are single invalid bytes (where Rust and Java agree), would it be worth adding a case like `ED A0 80`? With the current `from_utf8_lossy` that row would decode to three `U+FFFD` in Comet but one in Spark, so it would show whether the answers still line up on the surrogate path. *Disclosure: I used Claude Code to help review this PR, including running the Rust-vs-Java decoder comparison above.* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
