LuciferYang opened a new pull request, #56659:
URL: https://github.com/apache/spark/pull/56659

   ### What changes were proposed in this pull request?
   
   `VariantUtil.getString` and `VariantUtil.getMetadataKey` decoded their bytes 
with `new String(bytes, offset, length)`, which uses the JVM default charset. 
This PR decodes both with `StandardCharsets.UTF_8`, matching how 
`VariantBuilder` writes keys and string values 
(`getBytes(StandardCharsets.UTF_8)`) and the UTF-8 string data defined by the 
variant binary format. A round-trip test over non-ASCII object keys and string 
values is added.
   
   ### Why are the changes needed?
   
   The variant binary stores dictionary keys and string content as UTF-8, but 
the reader decoded them with the platform default charset, so the write and 
read sides disagree whenever that charset is not UTF-8. On Java 17 the default 
charset is environment-dependent (for example `US-ASCII` when the JVM starts 
under `LANG=C`), and reading a variant that contains non-ASCII keys or strings 
then silently corrupts those characters. Existing tests don't catch this 
because the test JVM pins `-Dfile.encoding=UTF-8`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No change when the JVM default charset is UTF-8, which is the common case 
and what CI runs. Under a non-UTF-8 default charset, operations that read 
variant strings or keys -- `to_json(variant)`, casting a variant to string, 
`schema_of_variant`, and field/key extraction -- now return the correct 
characters instead of corrupting non-ASCII keys and string values.
   
   ### How was this patch tested?
   
   Added a round-trip test in `VariantExpressionSuite` that parses and 
re-serializes JSON containing non-ASCII object keys and string values; 
`build/sbt 'catalyst/testOnly *VariantExpressionSuite'` passes. The corruption 
under a non-UTF-8 default charset was verified separately by decoding the same 
UTF-8 bytes under `-Dfile.encoding=US-ASCII`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Claude Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to