Re: [PR] [SPARK-57599][SQL] Decode Variant strings and keys as UTF-8 [spark]

via GitHub Tue, 23 Jun 2026 20:41:37 -0700


LuciferYang commented on PR #56659:
URL: https://github.com/apache/spark/pull/56659#issuecomment-4785662663


   Good catch, @uros-b - you're right the round-trip couldn't fail without the 
fix, since the test JVM pins `file.encoding=UTF-8`.
   
   I added a real regression guard, `VariantUtf8DecodeSuite` (`ac3f55a`), and 
dropped the round-trip in favor of it (`74362d2`). The new suite lives in 
`common/variant`, next to `VariantUtil`. It forks a child JVM with 
`-Dfile.encoding=ISO-8859-1` and round-trips a variant with non-ASCII object 
keys and string values there, including a value over 63 UTF-8 bytes so the 
`LONG_STR` path is covered. With the old default-charset decode the keys/values 
are corrupted (e.g. `café` -> `cafÃ©`) and the child exits non-zero; with the 
fix it passes. I verified both directions by temporarily reverting the fix. It 
self-cancels (`assume`) if the child ends up with a UTF-8 default charset, and 
adds ~0.4s.
   
   The PR is now self-contained in `common/variant`: the `VariantUtil` fix plus 
this suite, with no change to `VariantExpressionSuite`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57599][SQL] Decode Variant strings and keys as UTF-8 [spark]

Reply via email to