LuciferYang commented on PR #56659: URL: https://github.com/apache/spark/pull/56659#issuecomment-4785662663
Good catch, @uros-b - you're right the round-trip couldn't fail without the fix, since the test JVM pins `file.encoding=UTF-8`. I added a real regression guard, `VariantUtf8DecodeSuite` (`ac3f55a`), and dropped the round-trip in favor of it (`74362d2`). The new suite lives in `common/variant`, next to `VariantUtil`. It forks a child JVM with `-Dfile.encoding=ISO-8859-1` and round-trips a variant with non-ASCII object keys and string values there, including a value over 63 UTF-8 bytes so the `LONG_STR` path is covered. With the old default-charset decode the keys/values are corrupted (e.g. `café` -> `café`) and the child exits non-zero; with the fix it passes. I verified both directions by temporarily reverting the fix. It self-cancels (`assume`) if the child ends up with a UTF-8 default charset, and adds ~0.4s. The PR is now self-contained in `common/variant`: the `VariantUtil` fix plus this suite, with no change to `VariantExpressionSuite`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
