This is an automated email from the ASF dual-hosted git repository.

cloud-fan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 0594b127e029 [SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in 
Variant JSON parsing
0594b127e029 is described below

commit 0594b127e0294dd95fac2096575bd400d13086f9
Author: Jahnavi Nelavelli <[email protected]>
AuthorDate: Fri May 22 17:26:31 2026 +0800

    [SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON parsing
    
    Jackson's permissive surrogate handling let lone surrogates from \uXXXX 
escapes pass through `parse_json`, `try_parse_json`, and 
`from_json('variant')`, where `getBytes(UTF_8)` then silently substituted 
U+FFFD and corrupted the Variant. Validate the decoded strings before they 
enter the dictionary or write buffer, gated by a new internal SQL conf 
(default-on) for opt-out compatibility.
    
    ### What changes were proposed in this pull request?
    
    This PR adds strict Unicode validation to the Variant JSON parser so it 
rejects strings containing unpaired UTF-16 surrogate code units (e.g. a lone 
`\uD835` high surrogate). The check runs inside `VariantBuilder.buildJson` for 
both JSON object keys and string values, before either is encoded to UTF-8 and 
committed to the Variant binary buffer.
    The validation is gated by a new internal SQL conf 
`spark.sql.variant.validateUnicodeInJsonParsing`, defaulting to `true` so the 
strict, RFC 8259-compliant behavior is enabled by default. Setting the conf to 
`false` restores the legacy permissive behavior as a transitional escape hatch 
for pipelines that currently depend on it.
    
    The fix applies to all three Variant-parsing entry points:
    - `parse_json` — throws `MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION` on 
lone surrogates.
    - `try_parse_json` — returns `NULL`.
    - `from_json` — returns `NULL` in `PERMISSIVE` mode (default), throws in 
`FAILFAST`.
    
    ### Why are the changes needed?
    
    1. JSON containing a lone surrogate (e.g. `"\uD835"` not followed by a low 
surrogate) is invalid.
    2. Strict parsers such as simdjson reject these inputs; Jackson's 
`ReaderBasedJsonParser`, which Spark uses on the JVM, accepts them and decodes 
the escape into a Java `char` containing the lone surrogate.
    3.  The Variant ends up containing `?` where the original input was 
supposed to be, with no error or warning a silent data-corruption bug.
    4. The records containing `\uD835` were silently accepted with substituted 
characters when handled by the JVM, but correctly rejected by Photon.
    5.  This PR closes that JVM ↔ Photon divergence at its root.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. With the default spark.sql.variant.validateUnicodeInJsonParsing = 
true, JSON input containing an unpaired UTF-16 surrogate (e.g. a lone \uD835) 
will now produce an error instead of being silently accepted. Specifically:
    - parse_json throws MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION.
     - try_parse_json returns NULL.
     - from_json(col, 'variant') returns NULL in PERMISSIVE mode (default) and 
throws in FAILFAST.
    
    Previously, the lone surrogate was decoded into a Java char, then silently 
substituted with the Unicode replacement character during UTF-8 encoding, 
producing a Variant value containing ? with no error or warning. Setting 
spark.sql.variant.validateUnicodeInJsonParsing = false restores the previous 
permissive behavior as a transitional opt-out.
    
    ### How was this patch tested?
    
     Added new test cases in VariantExpressionEvalUtilsSuite (unit tests for 
both reject and legacy-mode paths, covering lone high/low surrogates as values 
and as object keys, plus valid surrogate pairs as a control) and 
VariantEndToEndSuite (end-to-end SQL test exercising parse_json / 
try_parse_json / from_json in both PERMISSIVE and FAILFAST modes, with the conf 
flipped on and off).
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    co-authored by : 'claude-opus-4.7'
    
    Closes #55661 from NJAHNAVI2907/SPARK-56654-strict-unicode-json.
    
    Lead-authored-by: Jahnavi Nelavelli 
<[email protected]>
    Co-authored-by: NJAHNAVI2907 <[email protected]>
    Co-authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
---
 .../apache/spark/types/variant/VariantBuilder.java | 74 ++++++++++++++++++++--
 .../variant/VariantExpressionEvalUtils.scala       |  6 +-
 .../expressions/variant/variantExpressions.scala   |  5 +-
 .../spark/sql/catalyst/json/JacksonParser.scala    |  5 +-
 .../org/apache/spark/sql/internal/SQLConf.scala    | 13 ++++
 .../variant/VariantExpressionEvalUtilsSuite.scala  | 46 ++++++++++++++
 .../function_is_valid_variant.explain              |  2 +-
 .../function_is_variant_null.explain               |  2 +-
 .../explain-results/function_parse_json.explain    |  2 +-
 .../function_schema_of_variant.explain             |  2 +-
 .../function_schema_of_variant_agg.explain         |  2 +-
 .../function_try_parse_json.explain                |  2 +-
 .../function_try_variant_get.explain               |  2 +-
 .../explain-results/function_variant_get.explain   |  2 +-
 .../apache/spark/sql/VariantEndToEndSuite.scala    | 33 ++++++++++
 15 files changed, 180 insertions(+), 18 deletions(-)

diff --git 
a/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
 
b/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
index 1bd008a5c914..aaf6f72bd536 100644
--- 
a/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
+++ 
b/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
@@ -43,7 +43,12 @@ import static org.apache.spark.types.variant.VariantUtil.*;
  */
 public class VariantBuilder {
   public VariantBuilder(boolean allowDuplicateKeys) {
+    this(allowDuplicateKeys, true);
+  }
+
+  public VariantBuilder(boolean allowDuplicateKeys, boolean 
validateUnicodeInJsonParsing) {
     this.allowDuplicateKeys = allowDuplicateKeys;
+    this.validateUnicodeInJsonParsing = validateUnicodeInJsonParsing;
   }
 
   /**
@@ -53,18 +58,41 @@ public class VariantBuilder {
    * @throws IOException if any JSON parsing error happens.
    */
   public static Variant parseJson(String json, boolean allowDuplicateKeys) 
throws IOException {
+    return parseJson(json, allowDuplicateKeys, true);
+  }
+
+  /**
+   * Similar to {@link #parseJson(String, boolean)}, but additionally controls 
whether JSON
+   * string contents are validated to be well-formed Unicode (no unpaired 
UTF-16 surrogate code
+   * units). Strict validation is the default and matches RFC 8259 section 7. 
The flag exists
+   * to allow callers to opt out for backward compatibility with input that 
previously parsed
+   * (with the unpaired surrogate silently replaced by the Unicode replacement 
character).
+   */
+  public static Variant parseJson(String json, boolean allowDuplicateKeys,
+      boolean validateUnicodeInJsonParsing) throws IOException {
     try (JsonParser parser = new JsonFactory().createParser(json)) {
       parser.nextToken();
-      return parseJson(parser, allowDuplicateKeys);
+      return parseJson(parser, allowDuplicateKeys, 
validateUnicodeInJsonParsing);
     }
   }
 
   /**
-   * Similar {@link #parseJson(String, boolean)}, but takes a JSON parser 
instead of string input.
+   * Similar to {@link #parseJson(String, boolean)}, but takes a JSON parser 
instead of string
+   * input.
    */
   public static Variant parseJson(JsonParser parser, boolean 
allowDuplicateKeys)
       throws IOException {
-    VariantBuilder builder = new VariantBuilder(allowDuplicateKeys);
+    return parseJson(parser, allowDuplicateKeys, true);
+  }
+
+  /**
+   * Similar to {@link #parseJson(JsonParser, boolean)}, but additionally 
controls whether JSON
+   * string contents are validated to be well-formed Unicode. See
+   * {@link #parseJson(String, boolean, boolean)}.
+   */
+  public static Variant parseJson(JsonParser parser, boolean 
allowDuplicateKeys,
+      boolean validateUnicodeInJsonParsing) throws IOException {
+    VariantBuilder builder = new VariantBuilder(allowDuplicateKeys, 
validateUnicodeInJsonParsing);
     builder.buildJson(parser);
     return builder.result();
   }
@@ -495,6 +523,9 @@ public class VariantBuilder {
         int start = writePos;
         while (parser.nextToken() != JsonToken.END_OBJECT) {
           String key = parser.currentName();
+          if (validateUnicodeInJsonParsing) {
+            checkValidUnicodeString(key, parser);
+          }
           parser.nextToken();
           int id = addKey(key);
           fields.add(new FieldEntry(key, id, writePos - start));
@@ -513,9 +544,14 @@ public class VariantBuilder {
         finishWritingArray(start, offsets);
         break;
       }
-      case VALUE_STRING:
-        appendString(parser.getText());
+      case VALUE_STRING: {
+        String text = parser.getText();
+        if (validateUnicodeInJsonParsing) {
+          checkValidUnicodeString(text, parser);
+        }
+        appendString(text);
         break;
+      }
       case VALUE_NUMBER_INT:
         try {
           appendLong(parser.getLongValue());
@@ -557,6 +593,30 @@ public class VariantBuilder {
     }
   }
 
+  // Reject JSON strings that contain unpaired UTF-16 surrogate code units. 
Java strings can
+  // hold lone surrogates, but RFC 8259 section 7 requires JSON string 
contents to be well-formed
+  // Unicode. Stricter parsers such as simdjson reject these inputs, while 
Jackson's
+  // `ReaderBasedJsonParser` accepts them and silently replaces the invalid 
character with U+FFFD
+  // when the result is encoded as UTF-8. That silent replacement causes data 
corruption, so
+  // we surface a JSON parse error instead.
+  private static void checkValidUnicodeString(String str, JsonParser parser)
+      throws JsonParseException {
+    int len = str.length();
+    for (int i = 0; i < len; ++i) {
+      char c = str.charAt(i);
+      if (Character.isHighSurrogate(c)) {
+        if (i + 1 >= len || !Character.isLowSurrogate(str.charAt(i + 1))) {
+          throw new JsonParseException(parser, String.format(
+              "Invalid Unicode in JSON string: lone high surrogate U+%04X", 
(int) c));
+        }
+        ++i;
+      } else if (Character.isLowSurrogate(c)) {
+        throw new JsonParseException(parser, String.format(
+            "Invalid Unicode in JSON string: lone low surrogate U+%04X", (int) 
c));
+      }
+    }
+  }
+
   // Try to parse a JSON number as a decimal. Return whether the parsing 
succeeds. The input must
   // only use the decimal format (an integer value with an optional '.' in it) 
and must not use
   // scientific notation. It also must fit into the precision limitation of 
decimal types.
@@ -583,4 +643,8 @@ public class VariantBuilder {
   // Store all keys in `dictionary` in the order of id.
   private final ArrayList<byte[]> dictionaryKeys = new ArrayList<>();
   private final boolean allowDuplicateKeys;
+  // When true, JSON string contents are validated to be well-formed Unicode 
(RFC 8259 sec 7).
+  // Unpaired UTF-16 surrogate code units cause a `JsonParseException` to be 
thrown during
+  // `buildJson`, which surfaces as a `MALFORMED_RECORD_IN_PARSING` error to 
SQL callers.
+  private final boolean validateUnicodeInJsonParsing;
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
index 1f23ba05bc03..b32064ce78c1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
@@ -34,7 +34,8 @@ object VariantExpressionEvalUtils {
   def parseJson(
       input: UTF8String,
       allowDuplicateKeys: Boolean = false,
-      failOnError: Boolean = true): VariantVal = {
+      failOnError: Boolean = true,
+      validateUnicodeInJsonParsing: Boolean = true): VariantVal = {
     def parseJsonFailure(exception: Throwable): VariantVal = {
       if (failOnError) {
         throw exception
@@ -43,7 +44,8 @@ object VariantExpressionEvalUtils {
       }
     }
     try {
-      val v = VariantBuilder.parseJson(input.toString, allowDuplicateKeys)
+      val v = VariantBuilder.parseJson(
+        input.toString, allowDuplicateKeys, validateUnicodeInJsonParsing)
       new VariantVal(v.getValue, v.getMetadata)
     } catch {
       case _: VariantSizeLimitException =>
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
index a928c59fdfc5..5d78f11bf86f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
@@ -62,8 +62,9 @@ case class ParseJson(child: Expression, failOnError: Boolean 
= true)
     Seq(
       child,
       Literal(SQLConf.get.getConf(SQLConf.VARIANT_ALLOW_DUPLICATE_KEYS), 
BooleanType),
-      Literal(failOnError, BooleanType)),
-    inputTypes :+ BooleanType :+ BooleanType,
+      Literal(failOnError, BooleanType),
+      
Literal(SQLConf.get.getConf(SQLConf.VARIANT_VALIDATE_UNICODE_IN_JSON_PARSING), 
BooleanType)),
+    inputTypes :+ BooleanType :+ BooleanType :+ BooleanType,
     returnNullable = !failOnError)
 
   override def inputTypes: Seq[AbstractDataType] =
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index 797355bed831..fa03a9aea833 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -122,6 +122,8 @@ class JacksonParser(
   }
 
   private val variantAllowDuplicateKeys = 
SQLConf.get.getConf(SQLConf.VARIANT_ALLOW_DUPLICATE_KEYS)
+  private val variantValidateUnicodeInJsonParsing =
+    SQLConf.get.getConf(SQLConf.VARIANT_VALIDATE_UNICODE_IN_JSON_PARSING)
 
   protected final def parseVariant(parser: JsonParser): VariantVal = {
     // Skips `FIELD_NAME` at the beginning. This check is adapted from 
`parseJsonToken`, but we
@@ -131,7 +133,8 @@ class JacksonParser(
       parser.nextToken()
     }
     try {
-      val v = VariantBuilder.parseJson(parser, variantAllowDuplicateKeys)
+      val v = VariantBuilder.parseJson(
+        parser, variantAllowDuplicateKeys, variantValidateUnicodeInJsonParsing)
       new VariantVal(v.getValue, v.getMetadata)
     } catch {
       case _: VariantSizeLimitException =>
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 77ef8bb600f9..b3de3f0813a7 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -6129,6 +6129,19 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val VARIANT_VALIDATE_UNICODE_IN_JSON_PARSING =
+    buildConf("spark.sql.variant.validateUnicodeInJsonParsing")
+      .internal()
+      .doc("When true, parsing variant from JSON rejects strings that contain 
unpaired UTF-16 " +
+        "surrogate code units (such as a lone high surrogate like \\uD835), 
which are invalid " +
+        "per RFC 8259 section 7. When false, restores the legacy permissive 
behavior in which " +
+        "the unpaired surrogate is silently replaced by the Unicode 
replacement character " +
+        "during UTF-8 encoding, causing data corruption that diverges from 
strict JSON parsers.")
+      .version("4.3.0")
+      .withBindingPolicy(ConfigBindingPolicy.SESSION)
+      .booleanConf
+      .createWithDefault(true)
+
   val VARIANT_ALLOW_READING_SHREDDED =
     buildConf("spark.sql.variant.allowReadingShredded")
       .internal()
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtilsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtilsSuite.scala
index 2aef7c455e64..605c542ba7f4 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtilsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtilsSuite.scala
@@ -140,6 +140,52 @@ class VariantExpressionEvalUtilsSuite extends 
SparkFunSuite {
     }
   }
 
+  test("SPARK-56654: reject unpaired UTF-16 surrogates in JSON strings") {
+    val invalidJsonInputs = Seq(
+      "\"\\uD835\"",                  // lone high surrogate (string value)
+      "\"\\uDC00\"",                  // lone low surrogate (string value)
+      "\"\\uD835x\\uDC00\"",          // surrogates separated by non-surrogate
+      "\"\\uD835\\uD835\"",           // two high surrogates in a row
+      "\"prefix \\uD835\"",           // trailing lone high surrogate
+      "{\"\\uD835\":1}",              // lone surrogate in an object key
+      "[\"ok\", \"\\uDC00\"]"         // lone surrogate inside an array element
+    )
+    for (json <- invalidJsonInputs) {
+      checkError(
+        exception = intercept[SparkThrowable] {
+          VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+            allowDuplicateKeys = false)
+        },
+        condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+        parameters = Map("badRecord" -> json, "failFastMode" -> "FAILFAST")
+      )
+      val tryResult = 
VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+        allowDuplicateKeys = false, failOnError = false)
+      assert(tryResult === null)
+    }
+    val validJsonInputs = Seq(
+      "\"\\uD83D\\uDE05\"",           // U+1F605 GRINNING FACE WITH SWEAT
+      "\"\\uD835\\uDC00\"",           // U+1D400 MATHEMATICAL BOLD CAPITAL A
+      "{\"\\uD83D\\uDE05\":1}",       // surrogate pair in an object key
+      "[\"\\uD835\\uDC00\"]"          // surrogate pair inside an array
+    )
+    for (json <- validJsonInputs) {
+      val parsed = 
VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+        allowDuplicateKeys = false)
+      assert(parsed != null, s"expected non-null variant for $json")
+    }
+  }
+
+  test("SPARK-56654: legacy mode accepts unpaired surrogates") {
+    val json = "\"\\uD835\""
+    val parsed = 
VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+      allowDuplicateKeys = false, validateUnicodeInJsonParsing = false)
+    assert(parsed != null)
+    val tryParsed = 
VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+      allowDuplicateKeys = false, failOnError = false, 
validateUnicodeInJsonParsing = false)
+    assert(tryParsed != null)
+  }
+
   test("isVariantNull") {
     def check(json: String, expected: Boolean): Unit = {
       if (json != null) {
diff --git 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_valid_variant.explain
 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_valid_variant.explain
index 324e72dfbcc1..93c3d3c0547e 100644
--- 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_valid_variant.explain
+++ 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_valid_variant.explain
@@ -1,2 +1,2 @@
-Project 
[static_invoke(VariantExpressionEvalUtils.isValidVariant(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
 false, true)))) AS is_valid_variant(parse_json(g))#0]
+Project 
[static_invoke(VariantExpressionEvalUtils.isValidVariant(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
 false, true, true)))) AS is_valid_variant(parse_json(g))#0]
 +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_variant_null.explain
 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_variant_null.explain
index 988447c2d641..62eeba635510 100644
--- 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_variant_null.explain
+++ 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_variant_null.explain
@@ -1,2 +1,2 @@
-Project 
[static_invoke(VariantExpressionEvalUtils.isVariantNull(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
 false, true)))) AS is_variant_null(parse_json(g))#0]
+Project 
[static_invoke(VariantExpressionEvalUtils.isVariantNull(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
 false, true, true)))) AS is_variant_null(parse_json(g))#0]
 +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_parse_json.explain
 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_parse_json.explain
index a40f89c03c88..ddaea2f318b1 100644
--- 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_parse_json.explain
+++ 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_parse_json.explain
@@ -1,2 +1,2 @@
-Project [static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false, true)) 
AS parse_json(g)#0]
+Project [static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false, true, 
true)) AS parse_json(g)#0]
 +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant.explain
 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant.explain
index c82a10655c33..70831c24afe8 100644
--- 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant.explain
+++ 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant.explain
@@ -1,2 +1,2 @@
-Project 
[static_invoke(SchemaOfVariant.schemaOfVariant(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
 false, true)))) AS schema_of_variant(parse_json(g))#0]
+Project 
[static_invoke(SchemaOfVariant.schemaOfVariant(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
 false, true, true)))) AS schema_of_variant(parse_json(g))#0]
 +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant_agg.explain
 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant_agg.explain
index 3d894628ab7e..de06881d96d4 100644
--- 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant_agg.explain
+++ 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant_agg.explain
@@ -1,2 +1,2 @@
-Aggregate 
[schema_of_variant_agg(static_invoke(VariantExpressionEvalUtils.parseJson(g#0, 
false, true)), 0, 0) AS schema_of_variant_agg(parse_json(g))#0]
+Aggregate 
[schema_of_variant_agg(static_invoke(VariantExpressionEvalUtils.parseJson(g#0, 
false, true, true)), 0, 0) AS schema_of_variant_agg(parse_json(g))#0]
 +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
index 7a0c0078128f..efbfc467ba4b 100644
--- 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
+++ 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
@@ -1,2 +1,2 @@
-Project [static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false, 
false)) AS try_parse_json(g)#0]
+Project [static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false, false, 
true)) AS try_parse_json(g)#0]
 +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_variant_get.explain
 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_variant_get.explain
index 65f527da78c4..68eace9d8322 100644
--- 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_variant_get.explain
+++ 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_variant_get.explain
@@ -1,2 +1,2 @@
-Project 
[try_variant_get(static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false, 
true)), $, IntegerType, false, Some(America/Los_Angeles)) AS 
try_variant_get(parse_json(g), $)#0]
+Project 
[try_variant_get(static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false, 
true, true)), $, IntegerType, false, Some(America/Los_Angeles)) AS 
try_variant_get(parse_json(g), $)#0]
 +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_variant_get.explain
 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_variant_get.explain
index 33c6b3a52f52..c13d45774492 100644
--- 
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_variant_get.explain
+++ 
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_variant_get.explain
@@ -1,2 +1,2 @@
-Project [variant_get(static_invoke(VariantExpressionEvalUtils.parseJson(g#0, 
false, true)), $, IntegerType, true, Some(America/Los_Angeles)) AS 
variant_get(parse_json(g), $)#0]
+Project [variant_get(static_invoke(VariantExpressionEvalUtils.parseJson(g#0, 
false, true, true)), $, IntegerType, true, Some(America/Los_Angeles)) AS 
variant_get(parse_json(g), $)#0]
 +- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
index 127c9218d4b7..2d26356890d2 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
@@ -185,6 +185,39 @@ class VariantEndToEndSuite extends SharedSparkSession {
     checkAnswer(variantDF, Seq(Row(expected)))
   }
 
+  test("SPARK-56654: parse_json/from_json reject unpaired UTF-16 surrogates by 
default") {
+    val invalidJson = "\"\\uD835\""
+    val df = Seq(invalidJson).toDF("j")
+    checkAnswer(df.selectExpr("try_parse_json(j)"), Seq(Row(null)))
+    checkAnswer(df.selectExpr("from_json(j, 'variant')"), Seq(Row(null)))
+    val parseJsonError = intercept[SparkException] {
+      df.selectExpr("parse_json(j)").collect()
+    }
+    checkError(
+      exception = parseJsonError,
+      condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+      parameters = Map("badRecord" -> invalidJson, "failFastMode" -> 
"FAILFAST")
+    )
+
+    val fromJsonFailFast = intercept[SparkException] {
+      df.selectExpr("from_json(j, 'variant', map('mode', 
'FAILFAST'))").collect()
+    }
+    checkError(
+      exception = fromJsonFailFast,
+      condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+      parameters = Map("badRecord" -> "[null]", "failFastMode" -> "FAILFAST")
+    )
+
+    withSQLConf(SQLConf.VARIANT_VALIDATE_UNICODE_IN_JSON_PARSING.key -> 
"false") {
+      val parsed = df.selectExpr("parse_json(j)").collect()
+      assert(parsed.length == 1 && parsed.head.get(0) != null,
+        "legacy mode should accept unpaired surrogates")
+      val tryParsed = df.selectExpr("try_parse_json(j)").collect()
+      assert(tryParsed.length == 1 && tryParsed.head.get(0) != null,
+        "legacy mode should accept unpaired surrogates via try_parse_json")
+    }
+  }
+
   test("to_variant_object - Codegen Support") {
     Seq("CODEGEN_ONLY", "NO_CODEGEN").foreach { codegenMode =>
       withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key -> codegenMode) {


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to