Re: [PR] [SPARK-57599][SQL] Decode Variant strings and keys as UTF-8 [spark]

via GitHub Tue, 23 Jun 2026 10:07:38 -0700


uros-b commented on code in PR #56659:
URL: https://github.com/apache/spark/pull/56659#discussion_r3461548325



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala:
##########
@@ -248,6 +248,27 @@ class VariantExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
     check(expectedResult4, smallObject, smallMetadata)
   }
 
+  test("SPARK-57599: non-ASCII keys and string values decode as UTF-8") {
+    // The variant writer stores object keys and string values as UTF-8. The 
reader must decode
+    // them as UTF-8 rather than the JVM default charset, which is 
platform-dependent on Java 17
+    // (e.g. US-ASCII under LANG=C) and would corrupt multi-byte characters. 
Round-trip JSON
+    // through parse_json/to_json to exercise both getMetadataKey (keys) and 
getString (values),
+    // including a value long enough (> 63 UTF-8 bytes) to use the LONG_STR 
encoding.
+    // scalastyle:off nonascii
+    val long = "你" * 22 // 66 UTF-8 bytes, over MAX_SHORT_STR_SIZE -> LONG_STR 
encoding
+    Seq(
+      "\"café\"", // 2-byte UTF-8 in a string value
+      "\"你好世界\"", // 3-byte UTF-8 in a string value
+      "{\"你好\":\"世界\"}", // non-ASCII object key and string value
+      "{\"é\":[\"é\",\"你好\"]}", // non-ASCII key with non-ASCII array values
+      "\"" + long + "\"", // LONG_STR string value
+      "{\"" + long + "\":\"ok\"}" // long non-ASCII object key
+    ).foreach { json =>
+      checkEvaluation(StructsToJson(Map.empty, Literal(parseJson(json))), json)
+    }
+    // scalastyle:on nonascii
+  }

Review Comment:
   Nit: the bundled test is a sanity round-trip, not a true regression guard: 
Spark's test JVM pins -Dfile.encoding=UTF-8 (project/SparkBuild.scala + Maven 
Surefire), so the old default-charset decode already produced UTF-8 under CI 
and the test passes with OR without the fix. 
   
   Can we make a better regression catch.



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionSuite.scala:
##########
@@ -248,6 +248,27 @@ class VariantExpressionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
     check(expectedResult4, smallObject, smallMetadata)
   }
 
+  test("SPARK-57599: non-ASCII keys and string values decode as UTF-8") {
+    // The variant writer stores object keys and string values as UTF-8. The 
reader must decode
+    // them as UTF-8 rather than the JVM default charset, which is 
platform-dependent on Java 17
+    // (e.g. US-ASCII under LANG=C) and would corrupt multi-byte characters. 
Round-trip JSON
+    // through parse_json/to_json to exercise both getMetadataKey (keys) and 
getString (values),
+    // including a value long enough (> 63 UTF-8 bytes) to use the LONG_STR 
encoding.
+    // scalastyle:off nonascii
+    val long = "你" * 22 // 66 UTF-8 bytes, over MAX_SHORT_STR_SIZE -> LONG_STR 
encoding
+    Seq(
+      "\"café\"", // 2-byte UTF-8 in a string value
+      "\"你好世界\"", // 3-byte UTF-8 in a string value
+      "{\"你好\":\"世界\"}", // non-ASCII object key and string value
+      "{\"é\":[\"é\",\"你好\"]}", // non-ASCII key with non-ASCII array values
+      "\"" + long + "\"", // LONG_STR string value
+      "{\"" + long + "\":\"ok\"}" // long non-ASCII object key
+    ).foreach { json =>
+      checkEvaluation(StructsToJson(Map.empty, Literal(parseJson(json))), json)
+    }
+    // scalastyle:on nonascii
+  }

Review Comment:
   Nit: the bundled test is a sanity round-trip, not a true regression guard: 
Spark's test JVM pins -Dfile.encoding=UTF-8 (project/SparkBuild.scala + Maven 
Surefire), so the old default-charset decode already produced UTF-8 under CI 
and the test passes with OR without the fix. 
   
   Can we make a better regression catch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57599][SQL] Decode Variant strings and keys as UTF-8 [spark]

Reply via email to