This is an automated email from the ASF dual-hosted git repository.
cloud-fan pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-4.x by this push:
new d4a0c9c34bb3 [SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in
Variant JSON parsing
d4a0c9c34bb3 is described below
commit d4a0c9c34bb38ffdd4c69c4b78d9d17219904eac
Author: Jahnavi Nelavelli <[email protected]>
AuthorDate: Fri May 22 17:26:31 2026 +0800
[SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON parsing
Jackson's permissive surrogate handling let lone surrogates from \uXXXX
escapes pass through `parse_json`, `try_parse_json`, and
`from_json('variant')`, where `getBytes(UTF_8)` then silently substituted
U+FFFD and corrupted the Variant. Validate the decoded strings before they
enter the dictionary or write buffer, gated by a new internal SQL conf
(default-on) for opt-out compatibility.
### What changes were proposed in this pull request?
This PR adds strict Unicode validation to the Variant JSON parser so it
rejects strings containing unpaired UTF-16 surrogate code units (e.g. a lone
`\uD835` high surrogate). The check runs inside `VariantBuilder.buildJson` for
both JSON object keys and string values, before either is encoded to UTF-8 and
committed to the Variant binary buffer.
The validation is gated by a new internal SQL conf
`spark.sql.variant.validateUnicodeInJsonParsing`, defaulting to `true` so the
strict, RFC 8259-compliant behavior is enabled by default. Setting the conf to
`false` restores the legacy permissive behavior as a transitional escape hatch
for pipelines that currently depend on it.
The fix applies to all three Variant-parsing entry points:
- `parse_json` — throws `MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION` on
lone surrogates.
- `try_parse_json` — returns `NULL`.
- `from_json` — returns `NULL` in `PERMISSIVE` mode (default), throws in
`FAILFAST`.
### Why are the changes needed?
1. JSON containing a lone surrogate (e.g. `"\uD835"` not followed by a low
surrogate) is invalid.
2. Strict parsers such as simdjson reject these inputs; Jackson's
`ReaderBasedJsonParser`, which Spark uses on the JVM, accepts them and decodes
the escape into a Java `char` containing the lone surrogate.
3. The Variant ends up containing `?` where the original input was
supposed to be, with no error or warning a silent data-corruption bug.
4. The records containing `\uD835` were silently accepted with substituted
characters when handled by the JVM, but correctly rejected by Photon.
5. This PR closes that JVM ↔ Photon divergence at its root.
### Does this PR introduce _any_ user-facing change?
Yes. With the default spark.sql.variant.validateUnicodeInJsonParsing =
true, JSON input containing an unpaired UTF-16 surrogate (e.g. a lone \uD835)
will now produce an error instead of being silently accepted. Specifically:
- parse_json throws MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION.
- try_parse_json returns NULL.
- from_json(col, 'variant') returns NULL in PERMISSIVE mode (default) and
throws in FAILFAST.
Previously, the lone surrogate was decoded into a Java char, then silently
substituted with the Unicode replacement character during UTF-8 encoding,
producing a Variant value containing ? with no error or warning. Setting
spark.sql.variant.validateUnicodeInJsonParsing = false restores the previous
permissive behavior as a transitional opt-out.
### How was this patch tested?
Added new test cases in VariantExpressionEvalUtilsSuite (unit tests for
both reject and legacy-mode paths, covering lone high/low surrogates as values
and as object keys, plus valid surrogate pairs as a control) and
VariantEndToEndSuite (end-to-end SQL test exercising parse_json /
try_parse_json / from_json in both PERMISSIVE and FAILFAST modes, with the conf
flipped on and off).
### Was this patch authored or co-authored using generative AI tooling?
co-authored by : 'claude-opus-4.7'
Closes #55661 from NJAHNAVI2907/SPARK-56654-strict-unicode-json.
Lead-authored-by: Jahnavi Nelavelli
<[email protected]>
Co-authored-by: NJAHNAVI2907 <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 0594b127e0294dd95fac2096575bd400d13086f9)
Signed-off-by: Wenchen Fan <[email protected]>
---
.../apache/spark/types/variant/VariantBuilder.java | 74 ++++++++++++++++++++--
.../variant/VariantExpressionEvalUtils.scala | 6 +-
.../expressions/variant/variantExpressions.scala | 5 +-
.../spark/sql/catalyst/json/JacksonParser.scala | 5 +-
.../org/apache/spark/sql/internal/SQLConf.scala | 13 ++++
.../variant/VariantExpressionEvalUtilsSuite.scala | 46 ++++++++++++++
.../function_is_valid_variant.explain | 2 +-
.../function_is_variant_null.explain | 2 +-
.../explain-results/function_parse_json.explain | 2 +-
.../function_schema_of_variant.explain | 2 +-
.../function_schema_of_variant_agg.explain | 2 +-
.../function_try_parse_json.explain | 2 +-
.../function_try_variant_get.explain | 2 +-
.../explain-results/function_variant_get.explain | 2 +-
.../apache/spark/sql/VariantEndToEndSuite.scala | 33 ++++++++++
15 files changed, 180 insertions(+), 18 deletions(-)
diff --git
a/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
b/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
index 1bd008a5c914..aaf6f72bd536 100644
---
a/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
+++
b/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java
@@ -43,7 +43,12 @@ import static org.apache.spark.types.variant.VariantUtil.*;
*/
public class VariantBuilder {
public VariantBuilder(boolean allowDuplicateKeys) {
+ this(allowDuplicateKeys, true);
+ }
+
+ public VariantBuilder(boolean allowDuplicateKeys, boolean
validateUnicodeInJsonParsing) {
this.allowDuplicateKeys = allowDuplicateKeys;
+ this.validateUnicodeInJsonParsing = validateUnicodeInJsonParsing;
}
/**
@@ -53,18 +58,41 @@ public class VariantBuilder {
* @throws IOException if any JSON parsing error happens.
*/
public static Variant parseJson(String json, boolean allowDuplicateKeys)
throws IOException {
+ return parseJson(json, allowDuplicateKeys, true);
+ }
+
+ /**
+ * Similar to {@link #parseJson(String, boolean)}, but additionally controls
whether JSON
+ * string contents are validated to be well-formed Unicode (no unpaired
UTF-16 surrogate code
+ * units). Strict validation is the default and matches RFC 8259 section 7.
The flag exists
+ * to allow callers to opt out for backward compatibility with input that
previously parsed
+ * (with the unpaired surrogate silently replaced by the Unicode replacement
character).
+ */
+ public static Variant parseJson(String json, boolean allowDuplicateKeys,
+ boolean validateUnicodeInJsonParsing) throws IOException {
try (JsonParser parser = new JsonFactory().createParser(json)) {
parser.nextToken();
- return parseJson(parser, allowDuplicateKeys);
+ return parseJson(parser, allowDuplicateKeys,
validateUnicodeInJsonParsing);
}
}
/**
- * Similar {@link #parseJson(String, boolean)}, but takes a JSON parser
instead of string input.
+ * Similar to {@link #parseJson(String, boolean)}, but takes a JSON parser
instead of string
+ * input.
*/
public static Variant parseJson(JsonParser parser, boolean
allowDuplicateKeys)
throws IOException {
- VariantBuilder builder = new VariantBuilder(allowDuplicateKeys);
+ return parseJson(parser, allowDuplicateKeys, true);
+ }
+
+ /**
+ * Similar to {@link #parseJson(JsonParser, boolean)}, but additionally
controls whether JSON
+ * string contents are validated to be well-formed Unicode. See
+ * {@link #parseJson(String, boolean, boolean)}.
+ */
+ public static Variant parseJson(JsonParser parser, boolean
allowDuplicateKeys,
+ boolean validateUnicodeInJsonParsing) throws IOException {
+ VariantBuilder builder = new VariantBuilder(allowDuplicateKeys,
validateUnicodeInJsonParsing);
builder.buildJson(parser);
return builder.result();
}
@@ -495,6 +523,9 @@ public class VariantBuilder {
int start = writePos;
while (parser.nextToken() != JsonToken.END_OBJECT) {
String key = parser.currentName();
+ if (validateUnicodeInJsonParsing) {
+ checkValidUnicodeString(key, parser);
+ }
parser.nextToken();
int id = addKey(key);
fields.add(new FieldEntry(key, id, writePos - start));
@@ -513,9 +544,14 @@ public class VariantBuilder {
finishWritingArray(start, offsets);
break;
}
- case VALUE_STRING:
- appendString(parser.getText());
+ case VALUE_STRING: {
+ String text = parser.getText();
+ if (validateUnicodeInJsonParsing) {
+ checkValidUnicodeString(text, parser);
+ }
+ appendString(text);
break;
+ }
case VALUE_NUMBER_INT:
try {
appendLong(parser.getLongValue());
@@ -557,6 +593,30 @@ public class VariantBuilder {
}
}
+ // Reject JSON strings that contain unpaired UTF-16 surrogate code units.
Java strings can
+ // hold lone surrogates, but RFC 8259 section 7 requires JSON string
contents to be well-formed
+ // Unicode. Stricter parsers such as simdjson reject these inputs, while
Jackson's
+ // `ReaderBasedJsonParser` accepts them and silently replaces the invalid
character with U+FFFD
+ // when the result is encoded as UTF-8. That silent replacement causes data
corruption, so
+ // we surface a JSON parse error instead.
+ private static void checkValidUnicodeString(String str, JsonParser parser)
+ throws JsonParseException {
+ int len = str.length();
+ for (int i = 0; i < len; ++i) {
+ char c = str.charAt(i);
+ if (Character.isHighSurrogate(c)) {
+ if (i + 1 >= len || !Character.isLowSurrogate(str.charAt(i + 1))) {
+ throw new JsonParseException(parser, String.format(
+ "Invalid Unicode in JSON string: lone high surrogate U+%04X",
(int) c));
+ }
+ ++i;
+ } else if (Character.isLowSurrogate(c)) {
+ throw new JsonParseException(parser, String.format(
+ "Invalid Unicode in JSON string: lone low surrogate U+%04X", (int)
c));
+ }
+ }
+ }
+
// Try to parse a JSON number as a decimal. Return whether the parsing
succeeds. The input must
// only use the decimal format (an integer value with an optional '.' in it)
and must not use
// scientific notation. It also must fit into the precision limitation of
decimal types.
@@ -583,4 +643,8 @@ public class VariantBuilder {
// Store all keys in `dictionary` in the order of id.
private final ArrayList<byte[]> dictionaryKeys = new ArrayList<>();
private final boolean allowDuplicateKeys;
+ // When true, JSON string contents are validated to be well-formed Unicode
(RFC 8259 sec 7).
+ // Unpaired UTF-16 surrogate code units cause a `JsonParseException` to be
thrown during
+ // `buildJson`, which surfaces as a `MALFORMED_RECORD_IN_PARSING` error to
SQL callers.
+ private final boolean validateUnicodeInJsonParsing;
}
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
index 1f23ba05bc03..b32064ce78c1 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
@@ -34,7 +34,8 @@ object VariantExpressionEvalUtils {
def parseJson(
input: UTF8String,
allowDuplicateKeys: Boolean = false,
- failOnError: Boolean = true): VariantVal = {
+ failOnError: Boolean = true,
+ validateUnicodeInJsonParsing: Boolean = true): VariantVal = {
def parseJsonFailure(exception: Throwable): VariantVal = {
if (failOnError) {
throw exception
@@ -43,7 +44,8 @@ object VariantExpressionEvalUtils {
}
}
try {
- val v = VariantBuilder.parseJson(input.toString, allowDuplicateKeys)
+ val v = VariantBuilder.parseJson(
+ input.toString, allowDuplicateKeys, validateUnicodeInJsonParsing)
new VariantVal(v.getValue, v.getMetadata)
} catch {
case _: VariantSizeLimitException =>
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
index a928c59fdfc5..5d78f11bf86f 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
@@ -62,8 +62,9 @@ case class ParseJson(child: Expression, failOnError: Boolean
= true)
Seq(
child,
Literal(SQLConf.get.getConf(SQLConf.VARIANT_ALLOW_DUPLICATE_KEYS),
BooleanType),
- Literal(failOnError, BooleanType)),
- inputTypes :+ BooleanType :+ BooleanType,
+ Literal(failOnError, BooleanType),
+
Literal(SQLConf.get.getConf(SQLConf.VARIANT_VALIDATE_UNICODE_IN_JSON_PARSING),
BooleanType)),
+ inputTypes :+ BooleanType :+ BooleanType :+ BooleanType,
returnNullable = !failOnError)
override def inputTypes: Seq[AbstractDataType] =
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index 797355bed831..fa03a9aea833 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -122,6 +122,8 @@ class JacksonParser(
}
private val variantAllowDuplicateKeys =
SQLConf.get.getConf(SQLConf.VARIANT_ALLOW_DUPLICATE_KEYS)
+ private val variantValidateUnicodeInJsonParsing =
+ SQLConf.get.getConf(SQLConf.VARIANT_VALIDATE_UNICODE_IN_JSON_PARSING)
protected final def parseVariant(parser: JsonParser): VariantVal = {
// Skips `FIELD_NAME` at the beginning. This check is adapted from
`parseJsonToken`, but we
@@ -131,7 +133,8 @@ class JacksonParser(
parser.nextToken()
}
try {
- val v = VariantBuilder.parseJson(parser, variantAllowDuplicateKeys)
+ val v = VariantBuilder.parseJson(
+ parser, variantAllowDuplicateKeys, variantValidateUnicodeInJsonParsing)
new VariantVal(v.getValue, v.getMetadata)
} catch {
case _: VariantSizeLimitException =>
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 77ef8bb600f9..b3de3f0813a7 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -6129,6 +6129,19 @@ object SQLConf {
.booleanConf
.createWithDefault(false)
+ val VARIANT_VALIDATE_UNICODE_IN_JSON_PARSING =
+ buildConf("spark.sql.variant.validateUnicodeInJsonParsing")
+ .internal()
+ .doc("When true, parsing variant from JSON rejects strings that contain
unpaired UTF-16 " +
+ "surrogate code units (such as a lone high surrogate like \\uD835),
which are invalid " +
+ "per RFC 8259 section 7. When false, restores the legacy permissive
behavior in which " +
+ "the unpaired surrogate is silently replaced by the Unicode
replacement character " +
+ "during UTF-8 encoding, causing data corruption that diverges from
strict JSON parsers.")
+ .version("4.3.0")
+ .withBindingPolicy(ConfigBindingPolicy.SESSION)
+ .booleanConf
+ .createWithDefault(true)
+
val VARIANT_ALLOW_READING_SHREDDED =
buildConf("spark.sql.variant.allowReadingShredded")
.internal()
diff --git
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtilsSuite.scala
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtilsSuite.scala
index 2aef7c455e64..605c542ba7f4 100644
---
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtilsSuite.scala
+++
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtilsSuite.scala
@@ -140,6 +140,52 @@ class VariantExpressionEvalUtilsSuite extends
SparkFunSuite {
}
}
+ test("SPARK-56654: reject unpaired UTF-16 surrogates in JSON strings") {
+ val invalidJsonInputs = Seq(
+ "\"\\uD835\"", // lone high surrogate (string value)
+ "\"\\uDC00\"", // lone low surrogate (string value)
+ "\"\\uD835x\\uDC00\"", // surrogates separated by non-surrogate
+ "\"\\uD835\\uD835\"", // two high surrogates in a row
+ "\"prefix \\uD835\"", // trailing lone high surrogate
+ "{\"\\uD835\":1}", // lone surrogate in an object key
+ "[\"ok\", \"\\uDC00\"]" // lone surrogate inside an array element
+ )
+ for (json <- invalidJsonInputs) {
+ checkError(
+ exception = intercept[SparkThrowable] {
+ VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+ allowDuplicateKeys = false)
+ },
+ condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+ parameters = Map("badRecord" -> json, "failFastMode" -> "FAILFAST")
+ )
+ val tryResult =
VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+ allowDuplicateKeys = false, failOnError = false)
+ assert(tryResult === null)
+ }
+ val validJsonInputs = Seq(
+ "\"\\uD83D\\uDE05\"", // U+1F605 GRINNING FACE WITH SWEAT
+ "\"\\uD835\\uDC00\"", // U+1D400 MATHEMATICAL BOLD CAPITAL A
+ "{\"\\uD83D\\uDE05\":1}", // surrogate pair in an object key
+ "[\"\\uD835\\uDC00\"]" // surrogate pair inside an array
+ )
+ for (json <- validJsonInputs) {
+ val parsed =
VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+ allowDuplicateKeys = false)
+ assert(parsed != null, s"expected non-null variant for $json")
+ }
+ }
+
+ test("SPARK-56654: legacy mode accepts unpaired surrogates") {
+ val json = "\"\\uD835\""
+ val parsed =
VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+ allowDuplicateKeys = false, validateUnicodeInJsonParsing = false)
+ assert(parsed != null)
+ val tryParsed =
VariantExpressionEvalUtils.parseJson(UTF8String.fromString(json),
+ allowDuplicateKeys = false, failOnError = false,
validateUnicodeInJsonParsing = false)
+ assert(tryParsed != null)
+ }
+
test("isVariantNull") {
def check(json: String, expected: Boolean): Unit = {
if (json != null) {
diff --git
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_valid_variant.explain
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_valid_variant.explain
index 324e72dfbcc1..93c3d3c0547e 100644
---
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_valid_variant.explain
+++
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_valid_variant.explain
@@ -1,2 +1,2 @@
-Project
[static_invoke(VariantExpressionEvalUtils.isValidVariant(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true)))) AS is_valid_variant(parse_json(g))#0]
+Project
[static_invoke(VariantExpressionEvalUtils.isValidVariant(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true, true)))) AS is_valid_variant(parse_json(g))#0]
+- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_variant_null.explain
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_variant_null.explain
index 988447c2d641..62eeba635510 100644
---
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_variant_null.explain
+++
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_is_variant_null.explain
@@ -1,2 +1,2 @@
-Project
[static_invoke(VariantExpressionEvalUtils.isVariantNull(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true)))) AS is_variant_null(parse_json(g))#0]
+Project
[static_invoke(VariantExpressionEvalUtils.isVariantNull(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true, true)))) AS is_variant_null(parse_json(g))#0]
+- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_parse_json.explain
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_parse_json.explain
index a40f89c03c88..ddaea2f318b1 100644
---
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_parse_json.explain
+++
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_parse_json.explain
@@ -1,2 +1,2 @@
-Project [static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false, true))
AS parse_json(g)#0]
+Project [static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false, true,
true)) AS parse_json(g)#0]
+- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant.explain
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant.explain
index c82a10655c33..70831c24afe8 100644
---
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant.explain
+++
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant.explain
@@ -1,2 +1,2 @@
-Project
[static_invoke(SchemaOfVariant.schemaOfVariant(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true)))) AS schema_of_variant(parse_json(g))#0]
+Project
[static_invoke(SchemaOfVariant.schemaOfVariant(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true, true)))) AS schema_of_variant(parse_json(g))#0]
+- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant_agg.explain
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant_agg.explain
index 3d894628ab7e..de06881d96d4 100644
---
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant_agg.explain
+++
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_schema_of_variant_agg.explain
@@ -1,2 +1,2 @@
-Aggregate
[schema_of_variant_agg(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true)), 0, 0) AS schema_of_variant_agg(parse_json(g))#0]
+Aggregate
[schema_of_variant_agg(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true, true)), 0, 0) AS schema_of_variant_agg(parse_json(g))#0]
+- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
index 7a0c0078128f..efbfc467ba4b 100644
---
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
+++
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_parse_json.explain
@@ -1,2 +1,2 @@
-Project [static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false,
false)) AS try_parse_json(g)#0]
+Project [static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false, false,
true)) AS try_parse_json(g)#0]
+- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_variant_get.explain
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_variant_get.explain
index 65f527da78c4..68eace9d8322 100644
---
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_variant_get.explain
+++
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_try_variant_get.explain
@@ -1,2 +1,2 @@
-Project
[try_variant_get(static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false,
true)), $, IntegerType, false, Some(America/Los_Angeles)) AS
try_variant_get(parse_json(g), $)#0]
+Project
[try_variant_get(static_invoke(VariantExpressionEvalUtils.parseJson(g#0, false,
true, true)), $, IntegerType, false, Some(America/Los_Angeles)) AS
try_variant_get(parse_json(g), $)#0]
+- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_variant_get.explain
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_variant_get.explain
index 33c6b3a52f52..c13d45774492 100644
---
a/sql/connect/common/src/test/resources/query-tests/explain-results/function_variant_get.explain
+++
b/sql/connect/common/src/test/resources/query-tests/explain-results/function_variant_get.explain
@@ -1,2 +1,2 @@
-Project [variant_get(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true)), $, IntegerType, true, Some(America/Los_Angeles)) AS
variant_get(parse_json(g), $)#0]
+Project [variant_get(static_invoke(VariantExpressionEvalUtils.parseJson(g#0,
false, true, true)), $, IntegerType, true, Some(America/Los_Angeles)) AS
variant_get(parse_json(g), $)#0]
+- LocalRelation <empty>, [id#0L, a#0, b#0, d#0, e#0, f#0, g#0]
diff --git
a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
index 127c9218d4b7..2d26356890d2 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
@@ -185,6 +185,39 @@ class VariantEndToEndSuite extends SharedSparkSession {
checkAnswer(variantDF, Seq(Row(expected)))
}
+ test("SPARK-56654: parse_json/from_json reject unpaired UTF-16 surrogates by
default") {
+ val invalidJson = "\"\\uD835\""
+ val df = Seq(invalidJson).toDF("j")
+ checkAnswer(df.selectExpr("try_parse_json(j)"), Seq(Row(null)))
+ checkAnswer(df.selectExpr("from_json(j, 'variant')"), Seq(Row(null)))
+ val parseJsonError = intercept[SparkException] {
+ df.selectExpr("parse_json(j)").collect()
+ }
+ checkError(
+ exception = parseJsonError,
+ condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+ parameters = Map("badRecord" -> invalidJson, "failFastMode" ->
"FAILFAST")
+ )
+
+ val fromJsonFailFast = intercept[SparkException] {
+ df.selectExpr("from_json(j, 'variant', map('mode',
'FAILFAST'))").collect()
+ }
+ checkError(
+ exception = fromJsonFailFast,
+ condition = "MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION",
+ parameters = Map("badRecord" -> "[null]", "failFastMode" -> "FAILFAST")
+ )
+
+ withSQLConf(SQLConf.VARIANT_VALIDATE_UNICODE_IN_JSON_PARSING.key ->
"false") {
+ val parsed = df.selectExpr("parse_json(j)").collect()
+ assert(parsed.length == 1 && parsed.head.get(0) != null,
+ "legacy mode should accept unpaired surrogates")
+ val tryParsed = df.selectExpr("try_parse_json(j)").collect()
+ assert(tryParsed.length == 1 && tryParsed.head.get(0) != null,
+ "legacy mode should accept unpaired surrogates via try_parse_json")
+ }
+ }
+
test("to_variant_object - Codegen Support") {
Seq("CODEGEN_ONLY", "NO_CODEGEN").foreach { codegenMode =>
withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key -> codegenMode) {
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]