LuciferYang opened a new pull request, #56659: URL: https://github.com/apache/spark/pull/56659
### What changes were proposed in this pull request? `VariantUtil.getString` and `VariantUtil.getMetadataKey` decoded their bytes with `new String(bytes, offset, length)`, which uses the JVM default charset. This PR decodes both with `StandardCharsets.UTF_8`, matching how `VariantBuilder` writes keys and string values (`getBytes(StandardCharsets.UTF_8)`) and the UTF-8 string data defined by the variant binary format. A round-trip test over non-ASCII object keys and string values is added. ### Why are the changes needed? The variant binary stores dictionary keys and string content as UTF-8, but the reader decoded them with the platform default charset, so the write and read sides disagree whenever that charset is not UTF-8. On Java 17 the default charset is environment-dependent (for example `US-ASCII` when the JVM starts under `LANG=C`), and reading a variant that contains non-ASCII keys or strings then silently corrupts those characters. Existing tests don't catch this because the test JVM pins `-Dfile.encoding=UTF-8`. ### Does this PR introduce _any_ user-facing change? No change when the JVM default charset is UTF-8, which is the common case and what CI runs. Under a non-UTF-8 default charset, operations that read variant strings or keys -- `to_json(variant)`, casting a variant to string, `schema_of_variant`, and field/key extraction -- now return the correct characters instead of corrupting non-ASCII keys and string values. ### How was this patch tested? Added a round-trip test in `VariantExpressionSuite` that parses and re-serializes JSON containing non-ASCII object keys and string values; `build/sbt 'catalyst/testOnly *VariantExpressionSuite'` passes. The corruption under a non-UTF-8 default charset was verified separately by decoding the same UTF-8 bytes under `-Dfile.encoding=US-ASCII`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
